Usability Testing Best Practices: An Interview with Rolf Molich

by Christine Perfetti

You may have never heard of Rolf Molich. Yet, if you’ve done any usability testing, design evaluations, or heuristic inspections, then you’ve been affected by his pioneering work.

Since entering the field in 1983, Rolf has produced some of the most impressive and forward-thinking research on effective discount usability techniques. Two of Rolf’s more renowned contributions include the co-invention of the Heuristic Inspection process with Jakob Nielsen and the more recent CUE (Comparative Usability Evaluation) studies.

The Heuristic Inspection approach turned the usability world on its head when Rolf and Jakob suggested that you could get value by having experts review interface designs. However, in recent years, Rolf has revisited his thinking on this method and now is questioning its effectiveness for all projects.

The CUE studies are the first of their kind. Usability practitioners from all over the world are asked to evaluate the same interface, using their standard practices. Rolf, along with Robin Jeffries and other collaborators (including the interface’s design teams), compared the different results, looking to see which practices were most effective at discovering and reporting usability practices.

The most famous study, CUE-2 had nine teams conduct usability tests of Microsoft’s Hotmail interface. More recently, CUE-4 had 18 evaluators (using both expert inspections and usability testing) looking at iHotelier’s Flash-based hotel reservation system.

While we’re still waiting for many of the results from the CUE-4 analysis, the CUE-2 study has changed the way we think about usability testing practice. (For example, many questions arose about how “scientific” usability practices really are when there wasn’t one problem that all nine teams reported.)

While preparing for Rolf’s full-day seminar at the User Interface 8 Conference, we had the opportunity to ask Rolf about some of his thoughts on the best practices surrounding usability testing. Here’s what we talked about:

UIE: Many critics of usability testing argue that usability testing can’t make up for a bad design. Do you agree that if a design team starts with a deeply flawed design, usability testing will diagnose many of the problems, but won’t necessarily point to a cure?

Rolf Molich: Alan Cooper has wisely said “If you want to create a beautifully cut diamond, you cannot begin with a lump of coal. No amount of chipping, chiseling, and sanding will turn that coal into a diamond.”

That said, I have helped a lot of my clients produce rather usable pieces of coal based on simple rules for how to write good error messages, re-phrase key messages, tune a local search engine, or make other kinds of quick-and-dirty last-minute changes.

Many usability practitioners believe that “eight users is enough” to find the majority of usability problems on web sites. In your experience, how many users is enough for testing?

It depends. The number of users needed for web-testing depends on the goal of the test. If you have no goal, then anything (including nothing) will do.

If your goal is to “sell” usability in your organization, then I believe 3-4 users will be sufficient. Much more important than the number of users is the sensible involvement of your project team in the test process and proper consensus-building after the test.

If you goal is to find catastrophic problems to drive an iterative development process, then 5-6 users are enough with the current state of the art.

However, if you want to find all usability problems in an interface, then a large number of users and facilitators will be necessary as shown by the CUE studies and UIE’s research. In the CUE-2 and CUE-4 studies tests with more than 50 users brought us large numbers of valid problems, but by far no exhaustive problem list.

Since you and Jakob first started promoting Heuristic Inspections, you’ve indicated you’re not as optimistic about the technique. Where are your thoughts on that particular method today?

Heuristic inspections are cheap, simple to explain, and deceptively simple to execute. However, I don’t use this method very often and I don’t recommend it to my clients. In my opinion, the idea that anyone can conduct a useful heuristic inspection after a crash course is rubbish. The results from my studies showed that inexperienced inspectors working on their own often produce disastrous amounts of “false alarms”.

Another problem is that heuristic inspection is based solely on opinions. No one has given me a good answer to the question that I’ve heard several times from disbelieving designers: “Why are your opinions better than mine?” I think that’s an excellent question, particularly knowing that users often prove me wrong whenever my heuristic predictions are put to a real usability test.

What prompted the CUE studies?

Curiosity and the need for solid data. With the CUE studies, I wanted to offer designers and usability practitioners a summary of current, state-of-the art usability testing practices. At the same time, I wanted to give the participating usability labs an opportunity to assess their strengths and weaknesses in the core practices of the usability profession.

What were the biggest surprises when you compared the processes and reports from each of the nine teams in CUE-2?

What surprised me most was that many of the tests did not fully live up to what I consider to be sound usability practices

In the CUE-2 study, nine teams tested the Hotmail website. Each team had three weeks to run the study, which included recruiting their own test participants and creating their own test tasks. We imposed as few restraints on the teams as possible to ensure that the teams did the tests exactly as they would have done if they had been ordinary client projects.

Many of the teams failed at creating professional test tasks that were realistic, frequently occurring tasks and free of hidden clues and jargon. Some teams also failed to distinguish between user data and personal opinions. Even more surprising was how unusable some of the usability teams’ reports were.

What elements were lacking in the test reports?

Above all, a good usability report must be usable. The main recommendations I give clients for creating a usable usability report are:

Keep it short.
No more than approximately 50 comments and 30 pages. It’s the job of the good usability professional to limit the comments to the ones that are really important.
Provide a one-page executive summary on page 2.
Include the top three positive comments and the top three problems. Four of the nine CUE-2 teams did not include an executive summary in their reports.
Include positive findings.
The ideal ratio between positive findings and problems is 1:1, but I have to admit that I rarely do better than 1:3. The CUE-2 teams ranged from no positive comments at all to an excellent ratio of 7:10.
Classify the comments.
Distinguish between disasters, serious problems, minor problems, positive findings, bugs, and suggestions for improving the interface. Three of the nine CUE-2 teams did not classify their comments at all. The remaining six each invented their own classification scheme.

Reports are of course useful, but even a perfect report is useless if it doesn’t cause beneficial changes to the user interface. For example, good communication with the development team through effective consensus building is far more important than a good test report.

In the CUE-2 study, there wasn’t a single usability problem that every team reported. The findings indicate a strong need for improvement in the usability testing process. Don’t you think your findings undermine the effectiveness of usability testing?

In my experience, usability testing is very effective for showing your colleagues what a usability problem looks like in their interface. But I think the study results indicate that usability testing is ineffective for finding all usability problems in an interface. Our results also indicate that it’s ineffective even for finding all the serious usability problems in an interface.

The CUE-2 teams reported 310 different usability problems. The most frequently reported problem was reported by seven of the nine teams. Only six problems were reported by more than half of the teams, while 232 problems (75%) were reported only once. Many of the problems that were classified as “serious” were only reported by a single team.

Even the tasks used by most or all teams produced very different results—around 70% of the findings for each of these common tasks were unique.

My main conclusion is that the assumption that all usability professionals use the same methods and get the same results in a usability test is plain wrong.

Given your findings, how can development teams confidently conclude they are changing the *right* problems on their web sites?

It’s very simple: They can’t be sure!

But if they are humble, listen to their critics, learn from their mistakes, avoid voodoo-methods, and use regular external coaching to catch bad habits, they may eventually detect so many real problems that it will drive the iteration forward in a useful way.

Given your results from the CUE studies, do you think usability testing will play a major role in creating usable web sites in the future?

Usability tests are spectacular. They are excellent for convincing skeptical colleagues that serious usability problems exist in an interface. But they are also inefficient and costly. We should use them mainly in an intermediate phase to establish trust with our colleagues, and then use much more cost-efficient preventive methods such as usable interface building blocks, reviews based on standards and proven guidelines, and contextual inquiry.

I hope that we will one day have huge libraries of generic interface building blocks that are thoroughly tested with real users and proven usable. I also hope that we will show how assembling such building blocks into full-blown websites by usability-conscious specialists will yield websites with a high degree of usability.

Thanks Rolf!

Published here on July 24, 2003.