Lecture 2: Standards & Specifications for Assessments & Items

Lecture 2: Standards & Specifications for Assessments & Items

Lecture Series 2: Standards & Specifications for Assessments & Items Jerome De Lisle, 2007 Updated 2010 The process of Test Development

After noting that test development The context begins with items, Embretson & Gorin (2001) pointed out that traditionally, item design is viewed primarily as an art. Item specifications are vague. Test

developers employ item specifications that often contain only general content considerations or vaguely described processing levels (p. 343) Test Development Is the process of creating an assessment in any format or mode. It requires 12 discrete steps in four

categories: DESIGN DEVELOPMENT ADMINISTRATION SCORING & EVALUATION TEST DEVELOPMENT Overall Plan

Content Definition Test Specifications Item Development Test Design & Assembly Test Production Test Administration Scoring Test Responses Standard Setting Score Reporting

Item Banking Writing Technical Report Standards & Specifications Definitions STANDARD- is a guideline or documentation that reflects

agreements on products, practices, or operations by nationally or internationally recognized industrial, professional, trade associations or governmental bodies. Note: This concept applies to formal, approved standards, as contrasted to de facto standards and proprietary standards,

which are exceptions to this concept. Definitions STANDARD- An exact value, a physical entity, or an abstract concept, established and defined by authority, custom, or common consent to serve as a reference, model, or rule in measuring quantities or qualities,

establishing practices or procedures, or evaluating results. Definitions-Types of Standards Standards Governing Assessment Practice The Standards for Educational and Psychological Testing govern assessment

practice and were developed jointly by the following organizations: American Educational Research Association (AERA) American Psychological Association (APA) National Council on Measurement in Education (NCME) Standards

There have been five earlier documents from these three sponsoring organizations providing standards to guide the development and use of educational and psychological tests- Standards Earlier Work

APA (1954) Technical recommendations for psychological tests and diagnostic techniques AERA, NCME (1955) Technical Recommendations for Achievement tests APA, AERA, NCME (1966) Standards for Educational and Psychological Tests & Manuals

Standards Earlier Work APA, AERA, NCME (1974) Standards for Educational and Psychological Tests APA, AERA, NCME (1985) Standards for educational and psychological testing. The Latest Standards

The Latest Standard The new Standards reflects changes in US federal law and measurement trends affecting validity testing individuals with disabilities or different linguistic backgrounds; new types of tests as well as new uses of existing tests. (APA Web site)

The Latest Standard In the 1994 document there were 264 standards (as opposed to 180 in the 1999 version) Most standards have explanatory text as the chapters in the text The Latest Standard

The Standards are written for the professional and for the educated layperson and addresses professional and technical issues of test development and use in education, psychology and employment. Revision of the 1999 Standards The 1999 Standards are in the process of

final revision and will be discussed at the 2011AERA/NCME Meeting The Latest Standards -Organization & Content Part I: Test Construction, Evaluation, & Documentation Part II: Fairness in Testing Part III: Testing Applications

The Latest Standards Organization & Content Part I: Test Construction, Evaluation, and Documentation Validity Reliability and Errors of Measurement Test Development and Revision Scales, Norms, and Score Comparability

Test Administration, Scoring, and Reporting Supporting Documentation for Tests The Latest Standards Organization & Content Part II: Fairness in Testing Fairness in Testing and Test Use The Rights and Responsibilities of Test

Takers Testing Individuals of Diverse Linguistic Backgrounds Testing Individuals with Disabilities The Latest Standards Organization & Content Part III: Testing Applications

The Responsibilities of Test Users Psychological Testing and Assessment Educational Testing and Assessment Testing in Employment and Credentialing Testing in Program Evaluation and Public Policy Standards & Test Development Overall Plan 1.1/3.2/3.9

Content Definition-1.6/3.2/3.11 Test Specifications-1.6/3.2/3.3-4 Item Development-3.6/3.7//3.17/7.2 Test Design & Assembly-3.7/3.8 Test Production- NONE Test Administration-3.18/3.19/3.20/3.21 Scoring Test Responses-3.6/3.22 Standard Setting-4.10/4.11/4.19/4.20 Score Reporting-8.13/ 11.6/ 11.12/ 11.15/

13.19 /15.10/15.11 Item Banking-6.4 Writing Technical Report-3.1/6.5 Other Current Standards for Assessment Practice CODE OF FAIR TESTING PRACTICES IN EDUCATION

Prepared by the Joint Committee on Testing Practices The Code of Fair Testing Practices in Education ( Code ) is a guide for professionals in fulfilling their obligation to provide and use tests that are fair to all test takers regardless of age, gender, disability, race, ethnicity, national origin, religion, sexual orientation,

linguistic background, or other personal characteristics. 2014 Revision of the Standards Differences Between the 1999 Version and the New Version In response to comments received at various stages of the revision process, the new version of the Standards includes updated material on such topics as educational accountability and technological advances in testing and a reworking

of chapters concerning workplace testing and credentialing. The overall organization of the revised Standards is also different from that of the 1999 edition. The new version is separated into Foundations, Operations and "Testing Applications sections. The Foundations section focuses on fundamental testing issues such as validity, reliability and fairness. The Operations section deals with operational testing issues such as test design and development, test administration, scoring and reporting and supporting documentation for tests. The Testing Applications section details specific applications in testing such as workplace testing and credentialing, educational

testing and assessment, and the use of tests for program evaluation, policy studies and accountability. While several chapters in the 1999 edition addressed fairness, the new edition both expands that material and integrates it into a single foundations chapter. The joint committee decided on this approach after concluding that the issue of fairness was so fundamental to testing practice that it should be considered as a foundation of testing along with validity and reliability. Members of the 2014 Joint Committee. (Top row, left to right):

Brian Gong, Laurie Wise, Fritz Drasgow, Michael Kolen, Denny Way, Paul Sackett, Frank Worrell. (Bottom row, left to right): Antonio E. Puente, Laura Hamilton, Barbara Plake, Joan Herman, Linda Cook, Nancy Tippins, Jo-Ida Hansen, Michael Kane Other Current Standards for Assessment Practice CODE OF FAIR TESTING PRACTICES IN

EDUCATION Fairness is a primary consideration in all aspects of testing. Careful standardization of tests and administration conditions helps to ensure that all test takers are given a comparable opportunity to demonstrate what they know and how they can perform in the area being tested. Fairness implies that every test taker has the opportunity to prepare for the test and is informed about the general nature

and content of the test, as appropriate to the purpose of the test. Fairness also extends to the accurate reporting of individual and group test results. Fairness is not an isolated concept, but must be considered in all aspects of the testing process. Your Assignment

Writing a Purpose & Rationale for Your Test The RATIONALE is the 1. fundamental reasons for the exercise; the basis of the project. 2. An exposition of principles or reasons. The PURPOSE is the The object toward which one strives or for

which something exists; an aim or a goal Writing a Purpose & Rationale for Your Test/Assessment Questions to ask What will the test do-formative/summative/ diagnostic? Why are you using this particular design specifications and items?

Why did you make the choices you did in the design specifications, development, and implantation? How is the aligned with the classroom system identified? Sample Sample

Sample (continued) Sample Specifications Specifications- are a detailed, exact statement of particulars, especially a statement prescribing materials,

dimensions, and quality of work for something to be built, installed, or manufactured. Test Specifications The 1999 Standards (Standard 1.6) encourages test writers to prepare a blueprint of the test, which would indicate the numbers of items or tasks and the

content area being sampled. This device will ensure content validity (appropriate and adequately sampled content) Sample Question A table of specifications (test blueprint) is most likely a component of the evidence for which ONE of the

following aspects of test validity? (A) content (B) criterion (C) predictive (D) concurrent (E) consequential Sample Question A medical student challenges his final year

examination results on the grounds that the examination measured knowledge of content that was not taught, necessary or critical to good practice. What source of evidence must be considered by the judicial system in coming to a decision (A) item specifications (B) test specifications (C) performance standards

(D) definition of minimum competence (E) scoring scheme and rubrics Item & test specifications What is a table of specifications for? A table of specifications or test blueprint is a basic step in planning an assessment.

It focuses not only on the chosen content, but also the level of questions written, ensuring alignment with course objectives What is a table of specifications for? The table of specifications includes both content and process dimensions and describes what the student must know

(rows) and what he or she is to do with that knowledge (columns) What is a table of specifications for? It may be constructed as a matrix, but other designs are possible. One important component of the table is the specification of the cognitive

domain. Standard 13.3 an 1.8 emphasizes that achievement tests must specify this information. SEA Sample Specifying the Cognitive Domain The table of specifications is a useful tool

in item generation because it can reduce the tendency of writers to construct low level questions for written papers. There should also be significant planning for performance assessments. Specifying the Cognitive Domain There is no need to use Blooms taxonomy exclusively in specifying the cognitive

domain, There are multiple classifications, the best (most widely applicable) being the three component knowledge, comprehension, and problem-solving Specifying the Cognitive Domain The problem-solving- category (Strategic and extended thinking) better represents

reality because it is difficult to distinguish tasks that elicit higher order skills (above application) See Haladyna (2004) Example on next page for 8th Grade math # Objectives/ Content item

s/ Knowledge Com prehension Application area/ Topics % of test 1. Know the advantages & disadvantages of

the major selection-types of questions. 2. Be able to diff erentiate between well and poorly written selection-type

questions x 20% 10 Qx 1 x 20% 10 Qx 1

20% 10 Q10 pts x 20% 5 Qx 2

3. Be able to construct appropriate selection-type questions using the guidelines and rules that were presented in class. TOTAL

40% 15 Q20 pts x 40% 4 Qx 5 20 Q

5Q 4Q 40% 4 Q20 pts

29 Q50 pts Domain Relevant Benchmarks Access to Information understand main ideas and

from Spoken Texts supporting details in a text and use this knowledge as needed 20% locate relevant information for a specific purpose

Access to Information understand main ideas and from Written Texts supporting details in a text 60% # of Level of Tasks Tasks

2 identify explicit opinions and feelings extract information from visual data, such as timetables and graphs extract relevant information for specific purposes

Possible Item Assessment Types Criteria sentence level advertisement (approximately one conversation minute)

instructions or directions text level interview (approximately two messages minutes) speech

chart matching multiple-choice open-ended

sentence completion sequence comprehension sentence level

matching multiple-choice open-ended sentence completion

table or chart comprehension 3 paragraph level understand structure and conventions of different text

types Possible Text Types text level (approximately 300 words)

advertisement book jacket comic strip email graph instructions message notice postcard

review short expository text timetable

Domain Written Presentation

Relevant Benchmarks # of Tasks describe people, places, things and events 2 20%

react to the content of something read or seen Oral Social Interaction answer questions about familiar topics 1

(to be administered in a random express feelings, likes and dislikes sample of classes) engage in short conversations Level of Tasks paragraph level (25-35 words) paragraph level (50-70 words)

produce short piece of coherent writing that conveys personal experiences Possible Text Types conversation (3-4 minutes)

picture description (2-3 minutes)

description giving opinion friendly letter message note postcard report

conversation description Item Specifications While the test blueprint is a useful guide, the process can be made more explicit through item specifications. Item specifications are detailed and

precise descriptions of what each item or task will measure on the test, how that item or task will be presented, and the range of allowable content Why Item specifications Some tasks and items are not well thought out and students might respond in unexpected or unpredictable ways, as

revealed by thinkalouds administered after the examinations. Why Item specifications Writing item specifications encourages the test constructor to think more deeply about what they intend to assess/measure and to explore various [alternative] ways of doing that.

The role of item specs Serve as guide to item writers Help in preparation of item banks Tell teachers and students want is important to learn The anatomy of item specs A GENERAL DESCRIPTION OF WHAT IS

TO BE MEASURED TOPIC & OBJECTIVE A DESCRIPTION/SET OF RULES FOR STATING THE PROBLEM A DESCRIPTION OR SET OF RULES FOR HOW THE STUDENT MIGHT RESPOND SAMPLE ITEM

Knowledge define identify indicate know label list memorize name

recall record relate repeat select underline Comprehension classify

describe discuss explain express identify locate paraphrase recognize report

restate review suggest summarize tell translate Application apply

compute construct demonstrate dramatize employ give examples illustrate interpret

investigate operate organize practice predict schedule shop sketch translate

use Analysis analyze appraise calculate categorize compare contrast

criticize debate determine diagram differentiate distinguish examine experiment inspect

inventory question relate solve Synthesis arrange assemble collect

compose construct create design formulate manage organize perform plan

prepare produce propose set-up Evaluation appraise assess choose

compare contrast decide estimate evaluate grade judge measure rate

revise score select value How good is the assessment? Standards for Judging the Quality of Assessments

How do we evaluate assessments? What standards do we use? These standards are found in the psychometric literature and documented in the Standards for Educational and Psychological Testing (current version released 2014). Standards for Judging the

Quality of Assessments The most important standard is validity, which assumes that the test is reliable or consistent. The third traditional standard is usability. Fairness is increasingly seen as one of the big three that includes validity and reliability (See 2014 standards)

Reliability Reliability means consistency across administration and users. Reliability is related to the number of items in traditional test, but traditional indices, like Cronbachs alpha may have little meaning for performance task, which are usually not generalizable.

Reliability (definition) Test reliability refers to the degree to which a test is consistent and stable in measuring what it is intended to measure. Most simply put, a test is reliable if it is consistent within itself and across time. Usability

Usability means practicability the ability to easily use and implement the assessment. Cumbersome assessments are difficult to manage and create invalidity during administration Validity as the most important standard?

The concept of validity has changed significantly over time and current conceptualizations are difficult for many to grasp. Debates about test validity continue. READING

Reading Current Conceptions of Validity Validity is about what the score from a test means- how we intend to interpret those meanings and whether those interpretations are appropriate and in line with our original purpose.

Reading Reading Thinkers on validity Thinkers on Validity

1982 Photo Michael Kane-An argument based approach to Validity Validity The understanding of validity as a static, all or nothing attribute is no longer in vogue.

The Trinity doctrine (construct, content, criterion) is also no longer accepted. Instead some believe in the unitary concept of validity in which construct is the whole of validity and different aspects are types of evidence for the whole. The 1999 Standards on Validity The degree to which evidence and

theory support the interpretation of test scores entailed by proposed used of tests. (AERA/APA/NCME, 1999). This new definition differs from the classic one which talks about the test measuring what it is intended to measure the focus here is on the test scores

Changing conceptions The most important influences on the definitions of validity in the 1985 and 1999 standards come from the work of Lee J. Cronbach and Samuel Messick. Future work is likely to be based on the conceptualization of Michael Kane and his argument-based approach to validity and validation.

Messicks Definition "an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationale support the adequacy and appropriateness of inferences and actions based on test scores and other modes of

assessment." Understanding Samuel Messick To understand Messicks writings we must appreciate that a test can never directly measure a construct, we can only infer about that construct from the results on the test. Therefore we must collect evidence as in

a court of law to prove the validity of our meanings and inferences from the test score. Changing Conceptions of Validity Unitary Concept [1940] The test

[1966] Use of the test and test score [current] Inferences & Decisions

based on the Score CONSEQUENCES OF USE Changing Conceptions of Validity [1950] Criterion based

validity [1960s] Content based validity Construct Validity Validity as Argument

Statement of claims -overall evaluation of intended interpretations and uses of scores Construct validity as the whole Validity Evidence?

Evidence for or against a degree of validity may come from The structure of the test The content of the test The responses to the test From variables that are supposed to be related (or not) to the test CONVERGENT (Measure same) DISCRIMINANT (Measure Different)

CRITERION (An external criterion of importance) PREDICTIVE CONCURRENT From the consequences of test score use Accumulating & Integrating Evidence

Components of a Validity Argument Expanded quality criteria for Performance Assessments Expanded Criteria for Complex Assessments-the Authors

Expanded Criteria Cognitive complexity -the assessment task calls for complex intellectual activity such as problem solving, critical thinking, and reasoning Content quality -the assessment calls for students to demonstrate

their knowledge of challenging and important subject matter Meaningfulness -the assessment tasks are worth students' time and students understand their value Language appropriateness -the language demands are clear and appropriate to the assessment tasks and to students

Expanded Criteria Transfer and generalizability -successful performance on the assessment task allows valid generalizations about achievement to be made; indicates ability to successfully perform other tasks Fairness -student performance is measured in a way that does not give advantage to factors irrelevant to school

learning; scoring schemes are similarly equitable Reliability -answers to assessment questions can be consistently trusted to represent what students know Consequences -the assessment has the desired effects on children, teachers, and the educational system Expanded Criteria

See http://www.gower.k12.il.us/Staff/ASSESS/ 4_ch2.htm http://www.ed.gov/pubs/IASA/newsletters/ assess/pt1.html Sample Question What is the current meaning of validity? Explain how the concept of validity has

changed over the last fifty years? It is important to remember, however, that no test is valid for all purposes. Indeed, tests vary in their intended uses and in their ability to provide meaningful assessments of student learning. Therefore, while the goal of using large-scale testing to measure and improve student and school system performance is laudable, it is also critical that such tests are sound, are scored properly, and are used appropriately.

Some public officials and educational administrators are increasingly calling for the use of tests to make high-stakes decisions, such as whether a student will move on to the next grade level or receive a diploma. School officials using such tests must ensure that students are tested on a curriculum they have had a fair opportunity to learn, so that certain subgroups of students, such as racial and ethnic minority students or students with a disability or limited English proficiency, are not systematically excluded or disadvantaged by the test or the test-taking conditions. Furthermore, high-stakes decisions

should not be made on the basis of a single test score, because a single test can only provide a "snapshot" of student achievement and may not accurately reflect an entire year's worth of student progress and achievement. APA web site Sample Question Sample Question

Based on the above comment on some likely validity issues in (a) the use of national achievement tests and (b) placement tests at Eleven Plus. Suggest workable strategies that might ameliorate the issues identified above. Many Samples from

http://title3.sde.state.ok.us/student assessment/testingmaterials.htm http://nces.ed.gov/TIMSS/Educator s.asp Problem solving questions at http://www.csd99.k12.il.us/north/li brary/links2004/math/problemsolvi

ng.htm Presentation, Assembly, & Administration There are different presentation rules for the different performance and written examinations Some constructed response items require that the item developer

specify the number of lines for response Presentation, Assembly, & Administration Context dependent MCQs should have be completed on the same page. Double columns are useful. All figures and tables should be

identified. Diagrams for all formats should be clear, well formatted and applicable. Task 1 Goal: The goal is to determine the pattern of gumballs that come out of the broken gumball machine Role: You work for the Chicklet Gumball Company as a vending machine technician.

Audience: The manager of Walmart in Waterford, Connecticut Situation: The gumball machine has been giving out extra gumballs with each additional penny. Children are excited about this and swarm around the machine in hopes of getting extra gumballs for their money. This consistent crowd is causing a safety hazard, so the manager would like you to repair the broken machine. The broken machine is also getting to be very expensive. Product Performance and Purpose: Before you can reprogram the machine, you need to determine the pattern in which it is dispersing gumballs and record it on a chart. Write a rule or an

equation to describe the pattern. Once you figure out the pattern, you need to test it out by seeing what would happen if a person puts in 5 pennies, 10 pennies, or even 100 pennies! Standards and Criteria for Success Scoring Rubric The student: Score of 3: Meets or exceeds the expectations of the task. Demonstrates a high level of understanding. The student achieves

correct solutions to each part of the task. The math language and representations are accurate and appropriate. The student creates a rule for finding any number of pennies. All work is shown and labeled. Score of 2: Partially meets the expectations of the task. Demonstrates some understanding. The student achieves correct solutions to all parts of the task. Work is clear and labeled. Some math language is used. Score of 1: Does not meet the objectives of the task. Demonstrates

poor or incorrect understanding. A partial solution is achieved. The math representation is labeled, but is not organized to show what strategy was used for solving the task. Score of 0: Shows no understanding of the problem or how to arrive at a solution. There are no solutions shown. No math language is used. Math language and representations are inaccurate and inappropriate.

Recently Viewed Presentations

  • CHAPTER 44 Urinary and Reproduct ive Disorders URINARY

    CHAPTER 44 Urinary and Reproduct ive Disorders URINARY

    CHAPTER 44 Urinary and Reproductive Disorders
  • Linking HOL Light to Mathematica using OpenMath Department

    Linking HOL Light to Mathematica using OpenMath Department

    Execution time = 2.677s. Applications and Demo. Computation . of . Eigenvalues . and. Eigenvectors . of a general matrix 2x2. 12. Execution time = 2.296s. Let's assume the following mathematical expression: It seems at first sight that verifying this...
  • Examples Scottish Summers Day Example 1 ~ #4

    Examples Scottish Summers Day Example 1 ~ #4

    Normal barley/wheat rotation. New crops canola and AWP. Test erosion of new crops in no tillage, minimum tillage and conventional tillage. Example 1 ~ #4 p79 As much land as needed. Cultivators set to 20 feet. Tradition drill at 10...
  • Auto Insurance Fraud Detection Dr. Shaun Aghili, CFE,

    Auto Insurance Fraud Detection Dr. Shaun Aghili, CFE,

    Project "Whiplash" as reported by the Toronto Star in 2012. Joint investigation conducted by the Toronto Police and the Insurance Bureau of Canada. 10 individuals from Markham and Toronto considered the ring leaders. 130 charges revolving around 77 staged collisions....
  • aix1.uottawa.ca

    aix1.uottawa.ca

    CML 2312: Administrative Law (Forcese) Consolidated Bathurst, SCC Ellis Don, SCC CONTENT OF PROCEDURAL OBLIGATIONS Audi Alteram The Precise Meaning of Audi Alteram (in Context): He or she who hears must decide Chronology of Key Cases Key Concepts Use of...
  • Chapter 1

    Chapter 1

    Basic Order of Operations BEDMAS A rule for basic order of operations B Brackets E Exponents D Division M Multiplication A Addition S Subtraction Examples Using BEDMAS Common Fractions Equivalent Fractions Change terms without changing value Equivalent Fractions in Lower...
  • CHAPTER 2: Special Theory of Relativity

    CHAPTER 2: Special Theory of Relativity

    Arial MS PGothic Wingdings Calibri Times New Roman Garamond Lucida Grande Humanst521 Lt BT Humanst521 BT SimSun Edge 2_Office Theme 1_Edge 2_Edge 3_Edge 4_Edge 5_Edge 6_Edge 7_Edge 3_Default Design Default Design 6_Office Theme 5_Office Theme 8_Edge 3_Office Theme 4_Office Theme...
  • Everest - Royal Geographical Society

    Everest - Royal Geographical Society

    The resulting work could be seen as co-produced. The nineteenth-century mapping of India, not only depended heavily on local labour but also drew on pre-colonial mapping traditions. Photo: Surveying in India Maclure, Macdonald & Macgregor after W.S. Sherwill.