Recent from talks
Contribute something
Nothing was collected or created yet.
Usability testing
View on WikipediaUsability testing is a technique used in user-centered interaction design to evaluate a product by testing it on users. This can be seen as an irreplaceable usability practice, since it gives direct input on how real users use the system.[1] It is more concerned with the design intuitiveness of the product and tested with users who have no prior exposure to it. Such testing is paramount to the success of an end product as a fully functioning application that creates confusion amongst its users will not last for long.[2] This is in contrast with usability inspection methods where experts use different methods to evaluate a user interface without involving users.
Usability testing focuses on measuring a human-made product's capacity to meet its intended purposes. Examples of products that commonly benefit from usability testing are food, consumer products, websites or web applications, computer interfaces, documents, and devices. Usability testing measures the usability, or ease of use, of a specific object or set of objects, whereas general human–computer interaction studies attempt to formulate universal principles.
What it is not
[edit]Simply gathering opinions on an object or a document is market research or qualitative research rather than usability testing. Usability testing usually involves systematic observation under controlled conditions to determine how well people can use the product.[3] However, often both qualitative research and usability testing are used in combination, to better understand users' motivations/perceptions, in addition to their actions.
Rather than showing users a rough draft and asking, "Do you understand this?", usability testing involves watching people trying to use something for its intended purpose. For example, when testing instructions for assembling a toy, the test subjects should be given the instructions and a box of parts and, rather than being asked to comment on the parts and materials, they should be asked to put the toy together. Instruction phrasing, illustration quality, and the toy's design all affect the assembly process.
Methods
[edit]Setting up a usability test involves carefully creating a scenario, or a realistic situation, wherein the person performs a list of tasks using the product being tested while observers watch and take notes (dynamic verification). Several other test instruments such as scripted instructions, paper prototypes, and pre- and post-test questionnaires are also used to gather feedback on the product being tested (static verification). For example, to test the attachment function of an e-mail program, a scenario would describe a situation where a person needs to send an e-mail attachment, and asking them to undertake this task. The aim is to observe how people function in a realistic manner, so that developers can identify the problem areas and fix them. Techniques popularly used to gather data during a usability test include think aloud protocol, co-discovery learning and eye tracking.
Hallway testing
[edit]Hallway testing, also known as guerrilla usability, is a quick and cheap method of usability testing in which people — such as those passing by in the hallway—are asked to try using the product or service. This can help designers identify "brick walls", problems so serious that users simply cannot advance, in the early stages of a new design. Anyone but project designers and engineers can be used (they tend to act as "expert reviewers" because they are too close to the project).
This type of testing is an example of convenience sampling and thus the results are potentially biased.
Remote usability testing
[edit]In a scenario where usability evaluators, developers and prospective users are located in different countries and time zones, conducting a traditional lab usability evaluation creates challenges both from the cost and logistical perspectives. These concerns led to research on remote usability evaluation, with the user and the evaluators separated over space and time. Remote testing, which facilitates evaluations being done in the context of the user's other tasks and technology, can be either synchronous or asynchronous. The former involves real time one-on-one communication between the evaluator and the user, while the latter involves the evaluator and user working separately.[4] Numerous tools are available to address the needs of both these approaches.
Synchronous usability testing methodologies involve video conferencing or employ remote application sharing tools such as WebEx. WebEx and GoToMeeting are the most commonly used technologies to conduct a synchronous remote usability test.[5] However, synchronous remote testing may lack the immediacy and sense of "presence" desired to support a collaborative testing process. Moreover, managing interpersonal dynamics across cultural and linguistic barriers may require approaches sensitive to the cultures involved. Other disadvantages include having reduced control over the testing environment and the distractions and interruptions experienced by the participants in their native environment.[6] One of the newer methods developed for conducting a synchronous remote usability test is by using virtual worlds.[7]
Asynchronous methodologies include automatic collection of user's click streams, user logs of critical incidents that occur while interacting with the application and subjective feedback on the interface by users.[6] Similar to an in-lab study, an asynchronous remote usability test is task-based and the platform allows researchers to capture clicks and task times. Hence, for many large companies, this allows researchers to better understand visitors' intents when visiting a website or mobile site. Additionally, this style of user testing also provides an opportunity to segment feedback by demographic, attitudinal and behavioral type. The tests are carried out in the user's own environment (rather than labs) helping further simulate real-life scenario testing. This approach also provides a vehicle to easily solicit feedback from users in remote areas quickly and with lower organizational overheads. In recent years, conducting usability testing asynchronously has also become prevalent and allows testers to provide feedback in their free time and from the comfort of their own home.
Expert review
[edit]Expert review is another general method of usability testing. As the name suggests, this method relies on bringing in experts with experience in the field (possibly from companies that specialize in usability testing) to evaluate the usability of a product.
A heuristic evaluation or usability audit is an evaluation of an interface by one or more human factors experts. Evaluators measure the usability, efficiency, and effectiveness of the interface based on usability principles, such as the 10 usability heuristics originally defined by Jakob Nielsen in 1994.[8]
Nielsen's usability heuristics, which have continued to evolve in response to user research and new devices, include:
- Visibility of system status
- Match between system and the real world
- User control and freedom
- Consistency and standards
- Error prevention
- Recognition rather than recall
- Flexibility and efficiency of use
- Aesthetic and minimalist design
- Help users recognize, diagnose, and recover from errors
- Help and documentation
Automated expert review
[edit]Similar to expert reviews, automated expert reviews provide usability testing but through the use of programs given rules for good design and heuristics. Though an automated review might not provide as much detail and insight as reviews from people, they can be finished more quickly and consistently. The idea of creating surrogate users for usability testing is an ambitious direction for the artificial intelligence community.
A/B testing
[edit]In web development and marketing, A/B testing or split testing is an experimental approach to web design (especially user experience design), which aims to identify changes to web pages that increase or maximize an outcome of interest (e.g., click-through rate for a banner advertisement). As the name implies, two versions (A and B) are compared, which are identical except for one variation that might impact a user's behavior. Version A might be the one currently used, while version B is modified in some respect. For instance, on an e-commerce website the purchase funnel is typically a good candidate for A/B testing, as even marginal improvements in drop-off rates can represent a significant gain in sales. Significant improvements can be seen through testing elements like copy text, layouts, images and colors.
Areas typically improved through A/B testing include algorithms, visuals, and workflow processes.[9]
Multivariate testing or bucket testing is similar to A/B testing but tests more than two versions at the same time.
Number of participants
[edit]In the early 1990s, Jakob Nielsen, at that time a researcher at Sun Microsystems, popularized the concept of using numerous small usability tests—typically with only five participants each—at various stages of the development process. His argument is that, once it is found that two or three people are totally confused by the home page, little is gained by watching more people suffer through the same flawed design. "Elaborate usability tests are a waste of resources. The best results come from testing no more than five users and running as many small tests as you can afford."[10]
The claim of "Five users is enough" was later described by a mathematical model[11] which states for the proportion of uncovered problems U
where p is the probability of one subject identifying a specific problem and n the number of subjects (or test sessions). This model shows up as an asymptotic graph towards the number of real existing problems (see figure below).
In later research Nielsen's claim has been questioned using both empirical evidence[12] and more advanced mathematical models.[13] Two key challenges to this assertion are:
- Since usability is related to the specific set of users, such a small sample size is unlikely to be representative of the total population so the data from such a small sample is more likely to reflect the sample group than the population they may represent
- Not every usability problem is equally easy-to-detect. Intractable problems happen to decelerate the overall process. Under these circumstances, the progress of the process is much shallower than predicted by the Nielsen/Landauer formula.[14]
Nielsen does not advocate stopping after a single test with five users; his point is that testing with five users, fixing the problems they uncover, and then testing the revised site with five different users is a better use of limited resources than running a single usability test with 10 users. In practice, the tests are run once or twice per week during the entire development cycle, using three to five test subjects per round, and with the results delivered within 24 hours to the designers. The number of users actually tested over the course of the project can thus easily reach 50 to 100 people. Research shows that user testing conducted by organisations most commonly involves the recruitment of 5-10 participants.[15]
In the early stage, when users are most likely to immediately encounter problems that stop them in their tracks, almost anyone of normal intelligence can be used as a test subject. In stage two, testers will recruit test subjects across a broad spectrum of abilities. For example, in one study, experienced users showed no problem using any design, from the first to the last, while naive users and self-identified power users both failed repeatedly.[16] Later on, as the design smooths out, users should be recruited from the target population.
When the method is applied to a sufficient number of people over the course of a project, the objections raised above become addressed: The sample size ceases to be small and usability problems that arise with only occasional users are found. The value of the method lies in the fact that specific design problems, once encountered, are never seen again because they are immediately eliminated, while the parts that appear successful are tested over and over. While it's true that the initial problems in the design may be tested by only five users, when the method is properly applied, the parts of the design that worked in that initial test will go on to be tested by 50 to 100 people.
Example
[edit]A 1982 Apple Computer manual for developers advised on usability testing:[17]
- "Select the target audience. Begin your human interface design by identifying your target audience. Are you writing for businesspeople or children?"
- Determine how much target users know about Apple computers, and the subject matter of the software.
- Steps 1 and 2 permit designing the user interface to suit the target audience's needs. Tax-preparation software written for accountants might assume that its users know nothing about computers but are experts on the tax code, while such software written for consumers might assume that its users know nothing about taxes but are familiar with the basics of Apple computers.
Apple advised developers, "You should begin testing as soon as possible, using drafted friends, relatives, and new employees":[17]
Our testing method is as follows. We set up a room with five to six computer systems. We schedule two to three groups of five to six users at a time to try out the systems (often without their knowing that it is the software rather than the system that we are testing). We have two of the designers in the room. Any fewer, and they miss a lot of what is going on. Any more and the users feel as though there is always someone breathing down their necks.
Designers must watch people use the program in person, because[17]
Ninety-five percent of the stumbling blocks are found by watching the body language of the users. Watch for squinting eyes, hunched shoulders, shaking heads, and deep, heart-felt sighs. When a user hits a snag, he will assume it is "on account of he is not too bright": he will not report it; he will hide it ... Do not make assumptions about why a user became confused. Ask him. You will often be surprised to learn what the user thought the program was doing at the time he got lost.
Education
[edit]Usability testing has been a formal subject of academic instruction in different disciplines.[18] Usability testing is important to composition studies and online writing instruction (OWI).[19] Scholar Collin Bjork argues that usability testing is "necessary but insufficient for developing effective OWI, unless it is also coupled with the theories of digital rhetoric."[20]
Survey research
[edit]Survey products include paper and digital surveys, forms, and instruments that can be completed or used by the survey respondent alone or with a data collector. Usability testing is most often done in web surveys and focuses on how people interact with survey, such as navigating the survey, entering survey responses, and finding help information. Usability testing complements traditional survey pretesting methods such as cognitive pretesting (how people understand the products), pilot testing (how will the survey procedures work), and expert review by a subject matter expert in survey methodology.[21]
In translated survey products, usability testing has shown that "cultural fitness" must be considered in the sentence and word levels and in the designs for data entry and navigation,[22] and that presenting translation and visual cues of common functionalities (tabs, hyperlinks, drop-down menus, and URLs) help to improve the user experience.[23]
See also
[edit]- Commercial eye tracking
- Component-based usability testing
- Crowdsourced testing
- Diary studies
- Don't Make Me Think
- Educational technology
- Heuristic evaluation
- ISO 9241
- RITE Method
- Software performance testing
- Software testing
- System usability scale (SUS)
- Test method
- Tree testing
- Universal usability
- Usability goals
- Usability of web authentication systems
References
[edit]- ^ Nielsen, J. (1994). Usability Engineering, Academic Press Inc, p 165
- ^ Mejs, Monika (2019-06-27). "Usability Testing: the Key to Design Validation". Mood Up team - software house. Retrieved 2019-09-11.
- ^ Dennis G. Jerz (July 19, 2000). "Usability Testing: What Is It?". Jerz's Literacy Weblog. Retrieved June 29, 2016.
- ^ Andreasen, Morten Sieker; Nielsen, Henrik Villemann; Schrøder, Simon Ormholt; Stage, Jan (2007). "What happened to remote usability testing?". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. p. 1405. doi:10.1145/1240624.1240838. ISBN 978-1-59593-593-9. S2CID 12388042.
- ^ Dabney Gough; Holly Phillips (2003-06-09). "Remote Online Usability Testing: Why, How, and When to Use It". Archived from the original on December 15, 2005.
- ^ a b Dray, Susan; Siegel, David (March 2004). "Remote possibilities?: international usability testing at a distance". Interactions. 11 (2): 10–17. doi:10.1145/971258.971264. S2CID 682010.
- ^ Chalil Madathil, Kapil; Greenstein, Joel S. (2011). "Synchronous remote usability testing". Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. pp. 2225–2234. doi:10.1145/1978942.1979267. ISBN 978-1-4503-0228-9. S2CID 14077658.
- ^ "Heuristic Evaluation". Usability First. Retrieved April 9, 2013.
- ^ Quin, Federico; Weyns, Danny; Galster, Matthias; Silva, Camila Costa (2023-08-09), A/B Testing: A Systematic Literature Review, arXiv, doi:10.48550/arXiv.2308.04929, arXiv:2308.04929, retrieved 2025-10-30
- ^ "Usability Testing with 5 Users (Jakob Nielsen's Alertbox)". useit.com. 2000-03-13.; references Nielsen, Jakob; Landauer, Thomas K. (1993). "A mathematical model of the finding of usability problems". Proceedings of the SIGCHI conference on Human factors in computing systems. pp. 206–213. doi:10.1145/169059.169166. ISBN 978-0-89791-575-5. S2CID 207177537.
- ^ Virzi, R. A. (1992). "Refining the Test Phase of Usability Evaluation: How Many Subjects is Enough?". Human Factors. 34 (4): 457–468. doi:10.1177/001872089203400407. S2CID 59748299.
- ^ Spool, Jared; Schroeder, Will (2001). Testing web sites: five users is nowhere near enough. CHI '01 extended abstracts on Human factors in computing systems. p. 285. doi:10.1145/634067.634236. S2CID 8038786.
- ^ Caulton, D. A. (2001). "Relaxing the homogeneity assumption in usability testing". Behaviour & Information Technology. 20 (1): 1–7. doi:10.1080/01449290010020648. S2CID 62751921.
- ^ Schmettow, Martin (1 September 2008). "Heterogeneity in the Usability Evaluation Process". Electronic Workshops in Computing. doi:10.14236/ewic/HCI2008.9.
{{cite journal}}: Cite journal requires|journal=(help) - ^ "Results of the 2020 User Testing Industry Report". www.userfountain.com. Retrieved 2020-06-04.
- ^ Bruce Tognazzini. "Maximizing Windows".
- ^ a b c Meyers, Joe; Tognazzini, Bruce (1982). Apple IIe Design Guidelines (PDF). Apple Computer. pp. 11–13, 15.
- ^ Breuch, Lee-Ann M. Kastman; Zachry, Mark; Spinuzzi, Clay (April 2001). "Usability Instruction in Technical Communication Programs: New Directions in Curriculum Development". Journal of Business and Technical Communication. 15 (2): 223–240. doi:10.1177/105065190101500204. S2CID 61365767.
- ^ Miller-Cochran, Susan K.; Rodrigo, Rochelle L. (January 2006). "Determining effective distance learning designs through usability testing". Computers and Composition. 23 (1): 91–107. doi:10.1016/j.compcom.2005.12.002.
- ^ Bjork, Collin (September 2018). "Integrating Usability Testing with Digital Rhetoric in OWI". Computers and Composition. 49: 4–13. doi:10.1016/j.compcom.2018.05.009. S2CID 196160668.
- ^ Geisen, Emily; Bergstrom, Jennifer Romano (2017). Usability Testing for Survey Research. Cambridge: Elsevier MK Morgan Kaufmann Publishers. ISBN 978-0-12-803656-3.
- ^ Wang, Lin; Sha, Mandy (2017-06-01). "Cultural Fitness in the Usability of U.S. Census Internet Survey in Chinese Language". Survey Practice. 10 (3): 1–8. doi:10.29115/SP-2017-0018.
- ^ Sha, Mandy; Hsieh, Y. Patrick; Goerman, Patricia L. (2018-07-25). "Translation and visual cues: Towards creating a road map for limited English speakers to access translated Internet surveys in the United States". Translation & Interpreting. 10 (2): 142–158. doi:10.12807/ti.110202.2018.a10. ISSN 1836-9324.
External links
[edit]Usability testing
View on GrokipediaDefinition and Fundamentals
Core Definition
Usability testing is an empirical method used to evaluate how users interact with a product or interface by observing real users as they perform representative tasks, aiming to identify usability issues and inform design improvements.[1] This approach relies on direct observation to gather data on user behavior, rather than relying solely on expert analysis or self-reported feedback, ensuring findings are grounded in actual performance.[7] The core components of usability testing include representative users who reflect the target audience, predefined tasks that simulate real-world usage scenarios, a controlled or naturalistic environment to mimic the intended context, and metrics focused on effectiveness (accuracy and completeness of task outcomes), efficiency (resources expended to achieve goals), and satisfaction (users' subjective comfort and acceptability). These elements, as defined in ISO 9241-11, provide a structured framework for assessing whether a product can be used to achieve specified goals within a given context.[8] The term "usability testing" emerged in the early 1980s within the field of human-computer interaction (HCI), building on foundational work like John Bennett's 1979 exploration of usability's commercial impact and methods such as the think-aloud protocol introduced by Ericsson and Simon in 1980.[5] Basic metrics commonly employed include task completion rates (percentage of users successfully finishing tasks), time on task (duration required to complete activities), and error rates (frequency of mistakes or deviations), which quantify performance and highlight areas needing refinement.[9] Usability testing plays a vital role in the broader user experience (UX) design process by validating designs iteratively.[10]Key Principles and Goals
Usability testing is fundamentally user-centered, emphasizing the direct involvement of target users to ensure designs align with their needs, behaviors, and contexts rather than relying solely on designer assumptions.[1] This principle prioritizes empirical evidence from real users over theoretical speculation, fostering products that are intuitive and accessible.[11] Observation of real users as they interact with prototypes or systems forms the core component of this approach.[1] A key principle is iterative testing, conducted repeatedly across design stages to incorporate feedback and refine interfaces progressively, thereby minimizing major overhauls later.[12] During sessions, the think-aloud protocol encourages participants to verbalize their thoughts in real time, uncovering cognitive processes, confusions, and decision-making paths that might otherwise remain hidden.[13] To maintain objectivity, facilitators adhere to the principle of avoiding leading questions, which could bias responses and skew insights into genuine user experiences. The primary goals of usability testing are to identify pain points—such as confusing navigation or frustrating interactions—that hinder user tasks, validate design assumptions against actual behavior, and provide actionable data to inform iterative improvements.[14] These objectives ensure that products evolve to better meet user expectations, enhancing overall adoption and success.[15] According to the ISO 9241-11 standard, usability is measured across three dimensions: effectiveness, which assesses the accuracy and completeness of goal achievement by specified users; efficiency, which evaluates the resources (like time or effort) expended relative to those goals; and satisfaction, which gauges user comfort, acceptability, and positive attitudes toward the system. By detecting and addressing issues early in the development process, usability testing plays a crucial role in reducing long-term costs, as fixing problems post-launch can be 100 times more expensive than during initial design phases, with documented returns on investment often exceeding 100:1.[16]Distinctions from Related Practices
What Usability Testing Is Not
Usability testing is not a one-time market research activity but an ongoing process of empirical evaluation integrated into the product development lifecycle to iteratively identify and address user experience issues.[17] Unlike market research, which often involves polling or surveys to gauge broad consumer opinions and preferences for strategic planning, usability testing relies on direct observation of user behaviors during task performance to reveal practical interaction problems.[18] This distinction ensures that usability testing supports continuous refinement rather than serving as a singular checkpoint for market validation.[19] Usability testing does not primarily focus on aesthetics or subjective preference polling but on assessing functional usability—such as task effectiveness, efficiency, and error rates—through observed user interactions.[1] While visual appeal can influence perceptions of usability via the aesthetic-usability effect, where attractive designs are deemed easier to use even if functionally flawed, testing prioritizes measurable performance over stylistic judgments.[20] Preference polling, by contrast, captures what users like or dislike without evaluating how well they can accomplish goals, making it unsuitable for uncovering core usability barriers.[18] A key boundary is that usability testing differs from focus groups, which collect attitudinal data through group discussions on needs, feelings, and opinions rather than behavioral evidence of product use.[18] In focus groups, participants react to concepts or demos in a social setting, often leading to groupthink or hypothetical responses that do not reflect real-world task execution.[18] Usability testing, however, involves individual users performing realistic tasks on prototypes or live systems under observation, emphasizing empirical data over verbal feedback to pinpoint interaction failures.[1] Usability testing is also distinct from beta testing, which occurs post-release with a wider audience to detect bugs, compatibility issues, and overall viability in real environments rather than preemptively evaluating iterative design usability.[21] While beta testing gathers broad feedback on a near-final product to inform minor adjustments before full launch, it lacks the controlled, task-focused structure of usability testing, which is conducted earlier and repeatedly during development to optimize user interfaces from the outset.[21] Finally, usability testing is not a substitute for accessibility testing, although the two can overlap in promoting inclusive experiences.[22] Accessibility testing specifically verifies compliance with standards like WCAG to ensure usability for people with disabilities, such as through screen reader compatibility or keyboard navigation, whereas general usability testing targets broader ease-of-use without guaranteeing accommodations for diverse abilities.[22] Relying solely on usability testing risks overlooking barriers for marginalized users, necessitating dedicated accessibility evaluations alongside it.[22]Comparisons with Other UX Evaluation Methods
Usability testing stands out from surveys in user experience (UX) evaluation by emphasizing direct observation of user behavior during interactions with a product or interface, rather than relying on self-reported attitudes or recollections. Surveys, being attitudinal methods, are efficient for gathering large-scale feedback on user preferences, satisfaction, or perceived ease of use, but they are prone to biases such as social desirability or inaccurate recall, which can obscure actual usage patterns.[23] In contrast, usability testing uncovers discrepancies between what users say they do and what they actually do, enabling the identification of friction points like confusing navigation that might not surface in questionnaire responses.[24] This behavioral approach, often involving think-aloud protocols, provides richer, context-specific insights into task completion challenges.[1] Compared to web analytics, usability testing delivers qualitative depth to complement the quantitative breadth of analytics tools, which track metrics such as page views, bounce rates, and time on task across vast user populations but offer no explanatory context for those behaviors. Analytics excel at revealing aggregate trends, like high drop-off rates on a checkout page, yet fail to explain underlying causes, such as unclear labeling or cognitive overload.[25] Usability testing, through moderated sessions, elucidates these "why" questions by capturing real-time user struggles and successes, though it typically involves smaller sample sizes and thus requires triangulation with analytics for broader validation.[23] This distinction highlights usability testing's role in exploratory phases, where understanding user intent and errors is paramount, versus analytics' strength in ongoing performance monitoring. Unlike A/B testing, which compares two or more design variants by measuring objective outcomes like conversion rates or click-throughs in live environments to determine relative effectiveness, usability testing focuses on diagnosing systemic usability issues rather than pitting options against each other. A/B testing is particularly valuable for optimizing specific elements, such as button colors, by exposing changes to large audiences and isolating variables for statistical significance, but it often misses deeper problems like overall workflow inefficiencies that affect long-term engagement.[26] Usability testing, by contrast, reveals why a design fails through iterative observation, informing holistic improvements that can yield larger gains in user satisfaction and efficiency.[27] These methods are not mutually exclusive and can be integrated to enhance UX evaluation; for example, administering surveys immediately after a usability testing session allows researchers to quantify attitudinal metrics, such as perceived usefulness via standardized scales like the System Usability Scale (SUS), while building on the behavioral data already collected.[28] This hybrid approach leverages the strengths of each—behavioral observation for diagnosis and self-reports for validation—leading to more robust insights without the limitations of relying on a single technique.[29]Historical Development
Origins in Human-Computer Interaction
Usability testing emerged as a core practice within human-computer interaction (HCI) during the 1970s and 1980s, driven by pioneers who emphasized empirical evaluation of user interfaces to improve system effectiveness. Ben Shneiderman, through his early experimental studies on programmer behavior and interface design at the University of Maryland, advocated for direct observation of users to identify usability issues, laying groundwork in works like his 1977 investigations into flowchart utility and command languages. Similarly, Don Norman, at the University of California San Diego and later Apple, integrated cognitive models into interface evaluation, promoting user-centered approaches that tested how mental models aligned with system behaviors during the late 1970s and early 1980s. These efforts shifted HCI from theoretical speculation to practical, user-involved assessment, influenced by the rapid proliferation of personal computing. The methodological foundations of usability testing drew heavily from cognitive psychology and ergonomics, adapting experimental techniques to evaluate human-system interactions. Cognitive psychology contributed protocols like think-aloud methods, inspired by Ericsson and Simon's 1980 work on verbal protocols, which allowed real-time observation of user thought processes during tasks. Ergonomics, or human factors engineering, provided iterative testing cycles, as seen in Al-Awar et al.'s 1981 study on tutorials for first-time computer users, where user trials led to rapid redesigns based on error rates and task completion times.[30] A seminal example was the lab-based user studies at Xerox PARC during the development of the Xerox Star workstation from 1976 to 1982, where human factors experiments—such as selection scheme tests—refined mouse interactions and icon designs through controlled observations and qualitative feedback.[31] The establishment of formal usability labs in the 1980s marked a professionalization of these practices, with IBM leading the way through dedicated facilities at its T.J. Watson Research Center. John Gould and colleagues implemented early lab setups for empirical testing, as detailed in their 1983 CHI paper, which outlined principles like early user involvement and iterative prototyping based on observed performance metrics from 1980 onward.[32] These labs facilitated systematic data collection via video recordings and performance logging, influencing industry standards for evaluating interfaces like text editors and full-screen systems.[33] A pivotal standardization came with Jakob Nielsen's 1993 book Usability Engineering, which synthesized these origins into a comprehensive framework for integrating testing into software development lifecycles, emphasizing discount methods and quantitative metrics like success rates from small user samples. This work built on the decade's empirical foundations to make usability testing accessible beyond research labs.Evolution and Modern Influences
In the 2000s, usability testing underwent significant adaptation to accommodate the rapid proliferation of web-based applications and mobile devices, driven by the need for faster development cycles in dynamic digital environments. As internet technologies accelerated web development—often compressing timelines to mere months—practitioners shifted toward iterative, "quick and clean" testing methods using prototypes to evaluate user-centered design early and frequently.[34] This era also saw the rise of testing for mobile interfaces, such as PDAs and cell phones, which emphasized real-world conditions like multitasking and small screens, moving beyond traditional lab settings to more naturalistic simulations.[34] Concurrently, the adoption of agile development methodologies in the early 2000s addressed limitations of sequential processes like waterfall, enabling usability testing to integrate into short sprints through discount engineering techniques that prioritized rapid qualitative feedback.[35] Around 2010, the widespread availability of high-speed internet and advanced screen-sharing tools catalyzed the proliferation of remote usability testing, allowing researchers to reach diverse, global participants without the constraints of physical labs. This shift was particularly impactful for web and software evaluation, as tools emerged in the mid-2000s to facilitate synchronous and asynchronous sessions, capturing real-time behaviors in users' natural environments.[36] By debunking early myths about distractions and data quality, remote methods gained traction for their cost-efficiency and ability to simulate authentic usage contexts, complementing in-lab approaches for broader validation.[37] Key milestones in this evolution include the foundational work of the Nielsen Norman Group, established in 1998, which popularized discount usability practices and empirical testing principles that influenced iterative methods across industries by the 2000s.[4] The launch of UserTesting.com in 2007 marked a pivotal advancement in remote testing accessibility, providing on-demand platforms that connected organizations with global user networks for video-based feedback, ultimately serving thousands of enterprises and capturing millions of testing minutes annually.[38] Entering the 2020s, usability testing has increasingly incorporated artificial intelligence and automation to enhance scalability and issue detection, with machine learning and large language models automating behavioral analysis and predictive insights from user interactions. A systematic literature review of 155 publications from 2014 to 2024 (as of April 2024) highlights a surge in AI applications for automated usability evaluation, particularly for detecting issues and assessing affective states, though most remain at the prototype stage with a focus on desktop and mobile devices.[39] This integration promises more efficient, data-driven reviews while building on core human-computer interaction principles of empirical user focus.Core Methods and Approaches
Moderated and In-Person Testing
Moderated and in-person usability testing involves a facilitator guiding participants through tasks in a face-to-face setting, typically within a controlled environment to observe user interactions directly.[1] This approach emphasizes interactive facilitation, where the moderator can adjust the session dynamically based on participant responses.[40] The setup for such testing often utilizes a dedicated usability lab divided into two rooms: a user testing room and an adjacent observation room separated by a one-way mirror.[41] In the user room, the participant interacts with the product on a testing laptop equipped with screen-recording software, a webcam to capture facial expressions, and sometimes multiple cameras for different angles, including overhead views for activities like card sorting.[1][41] The facilitator may sit beside the participant—often to the right for right-handed users—or communicate via a loudspeaker from the observation room, while observers in the second room view the session live through the mirror or duplicated screens on external monitors.[1][41] Elements like a lavaliere microphone ensure clear audio capture, and simple additions such as a plant help create a less clinical atmosphere.[41] During the process, the moderator introduces the session, explains the think-aloud protocol—where participants verbalize their thoughts and actions in real time—and assigns realistic tasks, such as troubleshooting an error message on a device.[13][1] The participant performs these tasks while narrating their reasoning, allowing the moderator to probe for clarification with follow-up questions like "What are you thinking right now?" without leading the user.[13] This verbalization reveals cognitive processes, frustrations, and decision-making, while the moderator notes behaviors and ensures the session stays on track, typically lasting 30-60 minutes per participant.[13][1] Key advantages include the ability to provide real-time clarification and intervention, enabling deeper insights into user motivations that might otherwise go unnoticed.[40] In-person observation also captures non-verbal cues, such as body language and facial expressions, which help interpret emotional responses and hesitation more accurately than remote methods.[40] These elements contribute to richer qualitative data, making it particularly effective for exploratory studies.[1] A common variant is hallway testing, an informal adaptation where the moderator recruits nearby colleagues or passersby for quick, low-fidelity sessions in non-lab settings like office hallways or cafes.[42] This guerrilla-style approach prioritizes speed and accessibility, often involving 3-5 participants to identify major usability issues early in design iterations.[43]Remote and Unmoderated Testing
Remote unmoderated usability testing involves participants completing predefined tasks on digital products independently, without real-time interaction from a researcher, using specialized software to deliver instructions, record sessions, and collect data asynchronously.[44] This approach evolved from traditional in-person methods to facilitate testing across diverse locations and schedules.[40] Participants receive pre-recorded or scripted tasks via the platform, follow automated prompts for think-aloud narration or responses, and submit recordings upon completion, allowing researchers to review qualitative videos and quantitative metrics such as task success rates later.[44] The process typically follows a structured sequence: first, defining study goals and participant criteria; second, selecting appropriate software; third, crafting clear task descriptions and questions; fourth, piloting the test to refine elements; fifth, recruiting suitable users from panels or custom sources; and sixth, analyzing the aggregated results for insights into user behavior and pain points.[44] Common tools include platforms like UserZoom, which supports screen capture, task recording, and integration with prototyping tools such as Miro, and Lookback, which enables voice and screen recording with recruitment via third-party panels like User Interviews.[45] These platforms automate data collection, including timestamped notes and auto-transcripts, to streamline asynchronous submissions without requiring live facilitation.[45] Key advantages of remote unmoderated testing include enhanced scalability, as multiple participants can engage simultaneously on their own timelines, enabling studies with dozens or hundreds of users in hours rather than days.[44] It promotes geographic diversity by allowing recruitment from global populations without travel constraints, reflecting varied user contexts more authentically.[40] Post-2010s advancements in accessible tools have driven cost savings, eliminating expenses for facilities, travel, and scheduling coordinators, making it a viable option for resource-limited teams.[40] However, challenges arise from the absence of real-time intervention, as researchers cannot clarify ambiguities or adapt tasks mid-session, potentially leading to misinterpreted instructions or incomplete data.[44] Technical issues, such as software incompatibilities, poor recording quality, or participant device limitations, can further compromise results without on-the-fly troubleshooting.[44] Additionally, participants may exhibit lower engagement, resulting in less nuanced behavioral insights compared to moderated formats, particularly for complex or exploratory tasks.[44]Expert-Based and Automated Reviews
Expert-based reviews in usability testing involve experienced practitioners applying established principles to inspect interfaces without direct user involvement, serving as efficient supplements to user-centered methods. These approaches, such as heuristic evaluation and cognitive walkthroughs, leverage expert knowledge to identify potential usability issues early in the design process. Automated reviews, on the other hand, use software tools to scan for violations of standards, providing quick, scalable feedback on aspects like accessibility and performance that influence usability. Together, these methods enable rapid iteration but are best combined with empirical user testing for validation. Heuristic evaluation is an informal usability inspection technique where multiple experts independently assess an interface against a predefined set of heuristics to uncover problems. Developed by Jakob Nielsen and Rolf Molich in 1990, the method typically involves 3-5 evaluators reviewing the design and listing violations, with severity ratings assigned to prioritize fixes. The process is cost-effective and can detect about 75% of usability issues when using 5 evaluators, though it risks missing issues unique to novice users. Nielsen refined the heuristics in 1994 into 10 general principles based on factor analysis of 249 usability problems, enhancing their applicability across interfaces. These heuristics include:- Visibility of system status: The system should always keep users informed about what is happening through appropriate feedback.[46]
- Match between system and the real world: The system should speak the users' language, with words, phrases, and concepts familiar to the user.[46]
- User control and freedom: Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state.[46]
- Consistency and standards: Users should not have to wonder whether different words, situations, or actions mean the same thing.[46]
- Error prevention: Even better than good error messages is a careful design which prevents a problem from occurring in the first place.[46]
- Recognition rather than recall: Minimize the user's memory load by making objects, actions, and options visible.[46]
- Flexibility and efficiency of use: Accelerators—unseen by the novice user—may often speed up the interaction for the expert user.[46]
- Aesthetic and minimalist design: Dialogues should not contain information which is irrelevant or rarely needed.[46]
- Help users recognize, diagnose, and recover from errors: Error messages should be expressed in plain language, precisely indicate the problem, and constructively suggest a solution.[46]
- Help and documentation: Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation.[46]
