# ASSESSMENT QUALITY ITEM QUALITY Item difficulty Item discrimination

ASSESSMENT QUALITY ITEM QUALITY Item difficulty Item discrimination Item scoring ITEM DIFFICULTY (1) Difficulty index for dichotomous (0/1) items Number of students responding correctly to an item Difficulty (p) = Total number of students responding to the item Higher values of p indicate item easiness p=.80: 80% of students answered an item correctly

Lower values of p indicate item difficulty p=.20: 20% of students answered an item incorrectly On a four-option multiple test, a .25 p value by chance (due to guessing) would be expected ITEM DIFFICULTY (2) Difficulty Index for rubric-scored (polytomous) items items Difficulty = Sum of student scores on the item for all students Total number of students responding to the item Average student score for the item

Higher values indicate item easiness On a 4-point rubric scored item, difficulty of 3.5 means most students achieved a high score Lower values indicate item difficulty On a 4-point rubric scored item, difficulty of 1.5 means most students scored low Can make comparable to p-values for dichotomous values by dividing by total number of points possible WHAT ITEM DIFFICULTY TELLS US I evaluated the difficulty of the items in my assessment by finding the percentage of students who passed each item. I found that most of my items had a p value (or difficulty

index) of between .2 and .8 which means that they were neither too easy or too difficult. Additionally, the items that were lower on the learning progression had higher values, meaning that more students passed the easier items and the items that were lower on the learning progression were had lower values, meaning that fewer students passed the more difficult items. This shows me that the sequence in my learning progression was pretty accurate in terms of reflecting increasing levels of sophistication as you more up the learning progression. ITEM DISCRIMINATION (1) Discrimination Index a relationship between students total test

scores and their performance on a particular item Type of Item Proportion of Correct Responses on Total Test Positive Discriminator Higher Scorers > Low Scorers Negative Discriminator Higher Scorers < Low Scorers Nondiscriminator

Higher Scorers = Low Scorers ITEM DISCRIMINATION (2) Computing an items discrimination: 1. Order the test papers from high to low by total score 2. Choose roughly the top 25% and the bottom 25% of these papers (e.g. if you have 25 students, you will choose about the top and bottom 6 students) 3. Calculate a p-value item difficulty (see previous slides) for each of the high and low groups 4. Subtract p(low) from p(high) to obtain each items discrimination index (D); D = p(h)

p(l) ITEM DISCRIMINATION (3) Guidelines for evaluating the discriminating efficiency of items (Ebel & Frisbie, 1991) Discrimination Index Item Evaluation .40 and above Very good items .30 - .39 Reasonably good items, but possibly subject to improvement .20 - .29

Marginal items, usually needing improvement .19 and below Poor items, to be rejected or improved by revision WHAT ITEM DISCRIMINATION TELLS US I evaluated the how well each item on my assessment differentiated between high and low performing students by finding the difference in item difficulty between high and low performing groups.

A good item should have a item discrimination value of .40 of above, which means that the item is useful in differentiating between students who are doing well overall, and students who are having trouble. RELIABLE ITEM SCORING Selection of a set of responses Random samples of student work, or Deliberately select from high-, medium- and lowachieving groups of students Score by multiple people Check for exact matches, difference of one score, difference of two or more Flag disagreements

Discuss the results Evaluation of the effectiveness of scoring rubrics Training for additional scoring Evaluation of inter-rater reliability ASSESSMENT QUALITY According to the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education Standards, assessment quality includes Reliability Validity Fairness

RELIABILITY What is it? "Could the results be replicated if the same individuals were tested again under similar circumstances? Reliability is the "consistency" or "repeatability" of students assessment results or scores Factors influencing consistency: Content or particular sample of tasks included on the assessment (e.g. item selection, different test form) Occasion on which the assessment is taken (e.g. test time, students condition, classroom environment) VALIDITY Definition: Does the test measure what it was designed to

measure? 5 standards of validity evidence (AERA, APA, NCME, 2014) Based on instrument content (Content Validity) Based on response processes Based on internal structure (Construct Validity) Based on relationship with external variable Based on consequences EVIDENCE OF VALIDITY #1 Evidence based on instrument content Do the items appear to measure the intended content? Are there items that measure all aspects of the content? Can come from expert judgments of items

Often based on a test blueprint Are all items associated with a part of the blueprint? Grainsize may be larger or smaller E.g. problem solving vs. finding patterns or explaining the correspondence between tables and graphs Are all intended sections represented by one or more items? LEARNING PROGRESSION-BASED TEST BLUEPRINT LP level 1 Cognitive Process Remember LP level 2

LP level 3 LP level 4 LP level 5 Level L1 L2 L3 One chooses the Learning Progression (LP) levels that reflect the instructional plan for the unit. One chooses the appropriate levels of cognitive rigor for the age and ability of students

Each LP level is represented by the appropriate number of items Each chosen level of cognitive rigor is represented by the appropriate number of items Note: Not all LP levels will include all levels of cognitive rigor. WHAT CONTENT VALIDITY TELLS US I evaluated the content validity of my assessment using an assessment blueprint to make sure that all intended standards are measured at the intended level of complexity by my items, which provides me confidence that the items are an accurate representation of the concepts in the learning progression. EVIDENCE OF VALIDITY #2 Evidence based on response processes

Does the test format influence the results? E.g. is a mathematical problem description so complex that reading level is also a factor in responses? E.g. effect of test format: Visual clutter or difficult-to-read font Make sure to give enough space for students to write complete answers Often based on interviews of selected students Cognitive interview or think-alouds in which students describe what they are thinking as they answer the items exit interviews or questionnaires, in which

TWO METHODS TO OBTAIN INFORMATION ABOUT RESPONSE PROCESSES Think Alouds Observing students who talk through their responses Exit Interviews Asking students to reprise their performance after taking instrument Asking them about their experiences THINK-ALOUD RESULTS EXIT INTERVIEWS Example Questions About how long did it take you to answer each

question? Question _____: ___________ minutes Question _____: ___________ minutes What parts of the test are confusing? What makes the test hard to understand or answer? Did you go back and change any of your answers at any point during the test? If so, why? Why didnt you write shorter explanations? Why didnt you write longer explanations? If you were writing this test, how would you change it to make it a better test? WHAT RESPONSE PROCESSES TELL US I evaluated the validity of my assessment using think-aloud interviews of several

students, which provides me confidence that the students responses to the items reflect what they know and can do and were not influenced by the format, test conditions, or wording of the items, nor of misunderstanding of the questions EVIDENCE OF VALIDITY #3 Evidence based on internal structure Do the items show the expected difficulty order (i.e. are the items intended to be easy actually easy, and those intended to be difficult actually difficult)? Is each item consistent with the overall purpose of the assessment? Should all the items be added into a single score, or should they be kept as separate subscores?

E.g. arithmetic accuracy and problem solving LOOKING AT INTERNAL STRUCTURE Are item difficulties as expected? Difficulty What I expected What happened Easy 1, 3, 5

1, 2, 3 Medium 2, 4, 6, 8 4, 5, 6, 8 Hard 7, 9, 10 7, 9, 10 Actual difficulty order similar to expectations. Reminder:

Difficulty = Sum of student scores on the item for all students Total number of students responding to the item WHAT INTERNAL STRUCTURE TELLS US I evaluated the validity of my assessment by comparing expected to actual item difficulty, which provides me confidence that my items were performing as I expected. Item difficulty was unlikely to have been influenced much by unintended factors (e.g. complex non-mathematical vocabulary for a math item)

The next three slides relate to validity evidence that is most important in large-scale testing, and less easily measured in a single school classroom We will not be discussing means of collecting these types of validity EVIDENCE OF VALIDITY #4 Evidence based on relations to external variables Does the test relate to external criteria it is expected to predict (e.g. other tests of similar content)? Is it less strongly related to criteria it would not be expected to predict (e.g. does a test of vocabulary relate more strongly to a reading

test than to an arithmetic test; does an emotional intelligence test relate more strongly to a personality assessment than to a test of academic skills) EVIDENCE OF VALIDITY #5 Evidence as to whether this assessment predicts as it should; e.g. Does a college entrance exam actually predict success in college? Does an occupational skills test actually measure skills that will be needed on the job in question? EVIDENCE OF VALIDITY #6 Evidence based on consequences

Is the consequence of using the assessment results as expected? E.g. content represented on test changing what is taught in classrooms, in undesirable ways For example, not including geometry-based items in elementary school assessments might result in teachers choosing not to teach these concepts Instructional consequences should be positive if the assessment method is valid and appropriate BALANCING RELIABILITY AND VALIDITY FAIRNESS Consistency and unbiasedness Students outcomes must not be influenced

by the particular rater who scored their work Items must not unintentionally favor or disadvantage students from specific groups A fair test provides scores that: Are interpreted and used appropriately for specific purposes Do not have adverse consequences as a result of the way they are interpreted/used EVIDENCE FOR FAIRNESS Reliable item scoring (slide #10) also gives us evidence for fairness Lack of item bias (also called differential item functioning) provides evidence for fairness Differential item functioning: Do two groups of

students, with otherwise equal proficiency, perform differently on an item? E.g. do girls and boys perform differently on an essay written in response to a novel about basketball players? E.g. do English language learners perform differently than native English speakers on math story problems? AN EXAMPLE OF ASSESSING ITEM FAIRNESS Item % boys passing % girls

passing 1 .50 .49 2 .67 .69 3

.30 .59 4 .44 .45 5 .39 .37

6 .85 .89 7 .95 .95 8 .73

.76 9 .62 .59 10 .77 .76 Many more girls than boys passed item 3; this item shows evidence of differential functioning by gender.

Groups dont always have to perform the same (e.g. ELL students may perform 10% worse than native speakers on all vocabulary items); as long as the difference between the groups is consistent, there is no differential functioning. WHAT FAIRNESS TELLS US I evaluated the functioning of my items for both boys and girls, and found that all of my items performed equivalently for both groups. This gives me confidence that my test is fair with respect to gender. BIBLIOGRAPHY N i t ko , A. J . , & B ro o k h a r t , S . ( 2 0 07 ) . E d u c a t i o n a l a s s e s s me n t o f s t u d e n t s . U p p e r S a d d l e R i v e r , N J : Pe a r s o n E d u c a t i o n , I n c . M c M i l l a n , J . H. ( 2 0 0 7 ) . C l a s s r o o m a s s e s s m e n t . P r i n c i p l e s a n d p r a c t i c e f o r

e ff e c t i v e s t a n d a r d - b a s e d i n s t r u c t i o n ( 4 t h e d . ) . B o s t o n : Pe a r s o n - A l l y n & B a c o n . O re g o n D e p a r t m e n t o f E d u c a t i o n . ( 2 0 14 , J u n e ) . A s s e s s me n t g u i d a n c e . Po p h a m , W. J . ( 2 0 1 4 ) . C ri t e r i o n - re f e re n c e d m e a s u re me n t : A h a l f- c e n t u r y w a s t e d ? Pa p e r p re s e n t e d a t t h e A n n u a l M e e t i n g o f N a t i o n a l C o u n c i l o n M e a s u re me n t i n E d u c a t i o n , P h i l a d e p h i a , PA. Po p h a m , W. J . ( 2 0 1 4 ) . C l a s s r o o m a s s e s s me n t : W h a t t e a c h e rs n e e d s t o k n o w . S a n Fr a n c i s c o , C A: Pe a r s o n Ru s s e l l , M . K . , & A i r a s i a n , P. W. ( 2 0 1 2 ) . C l a s s r o o m a s s e s s m e n t : C o n c e p t s a n d a p p l i c a t i o n s . N e w Yo rk , N Y: M c G r a w - H i l l . S t e v e n s , D . & Le v i , A. ( 2 0 0 5 ) . I n t r o d u c t i o n t o r u b r i c s . A s a s s e s s m e n t t o o l t o s a v e g r a d i n g t i m e , c o n v e y e ff e c t i v e f e e d b a c k , a n d p r o mo t e s t u d e n t l e a rn i n g . S t e r l i n g : S t y l u s Pu b l i s h i n g , L L C Wi h a rd i n i , D . ( 2 0 1 0 ) . A s s e s s me n t d e v e l o p me n t I I . U n p u b l i s h e d m a n u s c r i p t . Re s e a rc h a n d D e v e l o p m e n t D e p a r t m e n t , B i n u s B u s i n e s s S c h o o l , J a ka r t a , Indonesia. Wi l s o n , M . ( 2 0 0 5 ) . C o n s t r u c t i n g me a s u re s : A n i t e m re s p o n s e m o d e l i n g

a p p ro a c h . N e w Yo r k : Ps y c h o l o g y Pre s s , Ta y l o r & Fr a n c i s G ro u p . Wi l s o n , M . , & S l o a n e , K . ( 2 0 0 0 ) . Fro m p r i n c i p l e s t o p r a c t i c e : A n e m b e d d e d a s s e s s me n t s y s t e m . A p p l i e d M e a s u r e me n t i n E d u c a t i o n , 1 3 ( 2 ) , p p . 1 8 1 - 2 0 8 . CREATIVE COMMONS LICENSE A s s e s s m e n t Q u a l i t y P P T b y t h e O re g o n D e p a r t m e n t o f E d u c a t i o n a n d B e r ke l e y E v a l u a t i o n a n d A s s e s s m e n t Re s e a rc h C e n t e r i s l i c e n s e d u n d e r a C C BY 4 . 0. Yo u a r e f r e e t o : Sha re cop y an d red ist rib u te th e ma te rial in any med iu m o r forma t Adapt remix, transform, and build upon the material Under the following terms: A t t r i b u t i o n Yo u m u s t g i v e a p p r o p r i a t e c r e d i t, p r o v i d e a l i n k t o t h e l i c e n s e , a n d i n d i c a t e i f c h a n g e s w e r e m a d e. Yo u m a y d o s o i n a n y r e a s o n a b l e m a n n e r , b u t n o t i n a n y w a y that suggests the licensor endorses you or your use.

N o n C o m m e r c i a l Yo u m a y n o t u s e t h e m a t e r i a l f o r c o m m e r c i a l p u r p o s e s. Sha reAlike If you rem ix, tran sfo rm, or b uild u p on th e m at erial, yo u m ust distribu te you r contributions under the same license as the original. Oregon Department of Education welcomes editing of these resources and would g r e a t l y a p p r e c i a t e b e i n g a b l e t o l e a r n f r o m t h e c h a n g e s m a d e . To s h a r e a n e d i t e d version of this resource, please contact Cristen McLean, c r i s t e n . m c l e a n @ s t a t e . o r. u s.

## Recently Viewed Presentations

• MMU research team: Dr Lucy Webb, Dr Jo Ashby, Prof Josie Tetley, Ms Lorna Templeton, Dr Sam Wright, Dr Gemma Yarwood, Ms Amanda Clayson, Dr Marian Peacock, Mr Gary Witham, Prof Carol Haigh. Background.
• Ventilatory support duration (hr) Clinical Outcome 1- Impact of Age. Impact of Age on duration of ventilation & inotrope use. Matched for Cross-clamp time. ... (equivalent to control viability at 120mins) Images from the 24th of November 2016. JH.
• TOGAF is a framework(a detailed method and a set of supporting tools) for developing an enterprise architecture. TOGAF provides the methods and tools for assisting in the acceptance, production, use, and maintenance of an enterprise architecture. ... Project Management Last...
• Fire Extinguisher Inspection. Step by Step Approach. ... Visually inspect the hose and nozzle to ensure they are in good condition. ... damage. rust. wear. The dial on the pressure gauge should be in the green zone or operable zone....
• slope-intercept form* (y= mx+b. order) 1. Combine all x's together on right of equal sign (or . move over to the . right side . if . x's start . on left by inverse . operations (add or sub) 2....
• Alzheimer's Disease - A Public Health Crisis, is part of a curriculum for public health students entitled, A Public Health Approach to Alzheimer's and Other Dementias. It was developed by the Emory Centers for Training and Technical Assistance for the...
• Single Submission Portal (SSP) The project proposes to study different forms of Single Submission Portals including: Port Community Systems, Cargo Community Systems, Data Pipelines, e-Commerce Platforms, International Trade Integrated Service Platforms or Interorganizational Operating Information Systems.
• Late Middle Ages. 1000's spiritual revival happens throughout Europe ... Middle Ages Review . After the Roman Empire Split into two parts. The Eastern became known as the _____ The Western half of the Roman empire was invaded/attacked by _____...