Evaluating User Interfaces - University of Florida
Evaluating User Interfaces Chapter 4 Introduction Why evaluate? Designers become too entranced What I like Sunk cost fallacy Experienced designers know extensive testing is required How do you test? A web site? Air traffic control system?
How much would you budget for testing? When do you test? Are you required to test? (e.g. military, government, saftey) What does testing not do? Guarantee perfection Hard to finish testing Difficult to test unusual situations Military attack Heavy load (e.g. voting) Simulate accurate situations E.g. driving games, military games, medical sims
Expert Review Colleagues or Customers Ask for opinions Considerations: What is an expert? User or designer? Half day to week Heuristic Evaluation Give Expert heuristic, ask them to evaluate Eight Golden Rules
Specific to application area Box 4.1 Heuristics for gaming (Pinelle 2008) Provide consistent responses to users actions Allow users to customize video and audio setting, difficulty, and game speed Provide users with information on game status. Guidelines Review Interface is checked against organizational guidelines. Military Government Security Education
Consistency Inspection Verify consistency across family of interfaces Check terminology, fonts, color, layout, i/o formats Look at documentation and online help Also can be used in conjunction with software tools Cognitive Walkthrough Experts simulate being users going through the interface Tasks are ordered by frequency Good for interfaces that can be learned by exploratory
browsing (Wharton 1994) [novices] Usually walkthrough by themselves, then report their experiences (written, video) to designers meeting Useful if application is geared for group the designers might not be familiar with: Military, Assistive Technologies Metaphors of human Thinking (MOT) Experts consider metaphors for five aspects of human thinking Habit Stream of thought Awareness and Associations Relation between utterances and thought
Knowing Appears better than cognitive walkthgrough and heuristic evaluation Formal Usability Inspection Experts hold courtroom-style meeting Each side gives arguments (in an adversarial format) There is a judge or moderator Extensive and expensive Good for novice designers and managers Expert Reviews
Can be conducted at any time in the design process Focus on being comprehensive rather than being specific on improvements Example review recommendations Change log in procedure (from 3 to 5 minutes, because users were busy) Reordering sequence of displays, removing nonessential actions, providing feedback. Also come up with features for future releases Expert Review Placed in situation similar to user Take training courses
Read documentation Take tutorials Try the interface in a realistic work environment (complete with noise and distractions) Birds eye view Studying a full set of printed screens laid on the floor or pinned to the walls See topics such as consistency Software tools WebTango Usability Testing and Labs
1980s, testing was luxury (but deadlines crept up) Usability testing was incentive for deadlines Fewer project overlays Sped up projects Cost savings Rubin and Chisenll 2008, Sherman 2006, Dumas and Redish 1999
Labs are different than academia Less general theory More practical studies Usability Labs IBM early leader Microsoft next (>25 labs) Now hundreds of companies From http://www.ergosign.de/ Staff
Expertise in testing (psych, hci, comp sci) 10 to 15 projects per year Meet with UI architect to plan testing (Figure 4.2) Participate in early task analysis and design reviews T 2-6 weeks, creates study design and test plan E.g. Who are participants? Beta testers, current customers, in company staff, advertising T -1 week, pilot test (1-3 participants)
Participants Labs categorize users based on: Computing background Experience with task Motivation Education Ability with the language used in the interface Controls for Physical concerns (e.g. eyesight, handedness, age) Experimental conditions (e.g. time of day, physical surroundings, noise, temperature, distractions) Recording Participants
Logging is important, yet tedious Software to help (Live Logger, Morae, Spectator) Powerful to see people use your interface New approaches: eye tracking IRB items Focus users on interface Tell them the task, duration Thinking Aloud Concurrent think aloud Invite users to think aloud Nothing they say is wrong Dont interrupt, let the user talk
Spontaneous, encourages positive suggestions Can be done in teams of participants Retrospective think aloud Asks people afterwards what they were thinking Issues with accuracy Does not interrupt users (timings are more accurate) Types of Usability Testing Paper mockups and prototyping Inexpensive, rapid, very productive Low fidelity is sometimes
better (Synder, 2003) Mythical Man Month Prototype to throw away http://expressionflow.com/wp-content/uploads/2007/05/paper-mock-up.png http://user.meduni-graz.at/andreas.holzinger/holzinger/papers%20en/ Types of Usability Testing Discount usability testing Test early and often (with 3 to 6 testers) Pros: Most serious problems can be found with 6 testers. Good for formative evaluation (early) Cons: Complex systems cant be tested this way. Not good for summative evaluation (late)
Competitive usability testing Compare against prior or competitors versions Experimenter bias, be careful to not prime the user Within-subjects is preferred Types of Usability Testing Universal usability testing Test with highly diverse Users (experience levels, ability, etc.)
Platforms (mac, pc, linux) Hardware (old (how old is old?) -> latest) Networks (dial-up -> broadband) Field tests and portable labs Tests UI in realistic environments Beta tests Types of Usability Testing Remote usability testing (via web) Recruited via online communities, email Large n Difficulty in logging, validating data Software can help (NetMeeting, WebEx,
Sametime) Can You Break this Test Challenge testers to break a system Games, security, public displays (MOSI) Limitations Focuses on first-time users Limited coverage of interface features Emergency (military, medical, mission-critical) Rarely used features Difficult to simulate realistic conditions Testing mobile devices
Signal strength Batteries User focus Yet formal studies on user studies have identified Cost savings Return on investment (Sherman 2006, Bias and Mayhew 2005) Formal usability test reports Survey Instruments Questionnaires Paper or online (e.g. surveymonkey.com) Easy to grasp for many people
The power of many can be shown 80% of the 500 users who tried the system liked Option A 3 out of the 4 experts like Option B Success depends on Clear goals in advance Focused items Designing survey questions Ideally Based on existing questions Reviewed by colleagues Pilot tested
Direct activities are better than gathering statistics Fosters unexpected discoveries Important to pre-test questions Understandability Bias Likert Scales Most common methodology Strongly Agree, Agree, Neutral, Disagree, Strongly Disagree 5, 7, 9-point scales Examples
Improves my performance in book searching and buying Enables me to search and by books faster Makes it easier to search for an purchase books What does 1.5 mean? Most Used Likert-scales Questionnaire for User Interaction Satisfaction E.g. questions How long have you worked on this system? Learning to operate Difficult 1 2 3 4 5 6 7 8 9 Easy
System Usability Scale (SUS) Brooke 1996 Post-Study System Usability Questionniare Computer System Usability Questionniare Software usability Measurement Inventory Website Analysis and MeasureMent Inventory Mobile Phone Usability Questionnaire Questionnaire websites
Gary Perlmans website Jurek Kirakowskis website Validity, Reliability Bipolar Semantically Anchored Coleman and Williges (1985) Pleasant versus Irritating Hostile 1 2 3 4 5 6 7 Friendly If needed, take existing questionnaires and alter them slightly for your application Acceptance Tests
Set goals for performance Objective Measurable Examples Mean time between failures (e.g. MOSI) Test cases Response time requirements Readability (including documentation and help)
Satisfaction Comprensability Lets discuss We want the software to be user friendly. How could we rephrase it? Use a metric such as Shneidermans goals for interface design
Time for users to learn specific function Speed of Task performance Rate of Errors User retention Subjective satisfaction Examples (page 155 in book) Test A The participants will be
35 adults (25-45 years old) Native speakers with no disabilities Hired from an employment agency Moderate web-use experience (1-5 hours/week) for at least one year >30 of the 35 should complete the benchmark tests within 30 minutes Test B The participants will be 10 older adults 55-65 10 adult users with varying motor, visual, and auditory disabilities 10 adult users who are recent immigrants and use English as a second language Test C
Ten participants will be recalled after one week Carry out new set of benchmark tests In 20 minutes, at least 8 should be able to complete tasks Acceptance Tests By completing the acceptance tests Can be part of contractual fulfillment Demonstrate objectivity Different than usability tests More adversarial Neutral party should conduct that Ex. Video game and smartphone companies
App Store, Microsoft, Nintendo, Sony Evaluation during use Evaluation methods after a product has been released Interviews with individual users Get very detailed on specific concerns Costly and time-consuming Focus group discussions Patterns of usage Certain people can dominate or sway opinion Targeted focus groups Case Study
45 min interviews with 66 of the 4300 users of an internal message system Happy with: legibility, convenience, online access Concerns with: reliability, confusing, and accessibility 42 enhancements that differed from what designers thought they should implement. How would you change the system architecture for suggested changes? Could you change your projects easily? Continuous Logging The system itself logs user usage Video game example Other examples
Track frequency of errors (gives an ordered list of what to address via tutorials, training, text changes, etc.) Speed of performance Track which features are used and which are not Web Analytics Privacy? What gets logged? Opt-in/out? What about companies? Online and Telephone Help Users enjoy having people ready to help (realtime chat online or via telephone) E.g. Netflix has 8.4 million customers, how many telephone customer service reps? 375
Expensive, but higher customer satisfaction Cheaper version are Bug Report systems Windows, Chrome, Bugzilla Automated Evaluation Software for evaluation Low level: Spelling, term concordance Metrics: number of displays, tabs, widgets, links E.g. Tulliss Display Analysis Program (1988) Inputs: alphanumeric screen designs Output ex.: Upper-case letters: 77%, the percentage of upper-case letters is high. Consider using more lower-case letters, since text printed in normal
upper and lower case letters is read about 13% faster than all upper case. World Wide Web Consortium Markup Validation US NIST Web Metrics Testbed Section 508 for accessibility New research areas: Evaluation of mobile platforms
Jobs are divided into map and reduce tasks. An instance of running a task is called a task attempt. Multiple jobs can be composed into a workflow. Job submission process. Client (i.e., driver program) creates a job, configures it, and...
The Rise of Monotheism. The movement towards a distinctive monotheistic religious tradition with a distinct emphasis on ethical values emerged in the Middle East. It found expression in Zoroastrianism and in Judaism. While these religions did not spread widely, the...
6.5.3 - Fringe Banking vs. Traditional Banking • How to calculate the nominal interest rate for a fringe banking transaction • What the most regulated business in the US is • Whether fringe banking is regulated by the federal and...
Recipe Book Markup Language. Why Markup languages? Give structure of contents - aid in interpreting semantics of content, storing in database, etc. Why XML? Human readable (sort of) Widely accepted and used for data interchange. Why RBML? Don't reinvent the...
Mehl, S.W. and M.C. Hill, 2002a, Development and evaluation of a local grid refinement method for block-centered finite-difference groundwater models using shared nodes: Advances in Water Resources, v. 25, p. 497-511.
Ready to download the document? Go ahead and hit continue!