Evidence-based Evaluation of English Learners: A contemporary approach

Evidence-based Evaluation of English Learners: A contemporary approach

Evidence-based Evaluation of English Learners: A contemporary approach to testing. Oklahoma School Psychologists Association Tulsa, OK March 30, 2018 Samuel O. Ortiz, Ph.D. St. Johns University The Testing of Bilinguals: Early influences and a lasting legacy. It was believed that: speaking English, familiarity with and knowledge of U.S. culture had no bearing on intelligence test performance intelligence was genetic, innate, static, immutable, and largely unalterable by experience, opportunity, or environment being bilingual resulted in a mental handicap that was measured by poor performance on intelligence tests and thus substantiated its detrimental influence Much of the language and legacy ideas remain embedded in present day tests. Very Superior Precocious Superior Superior High Average evolved from Normal Average

Borderline Low Average Moron Borderline Imbecile Deficient Idiot The Testing of Bilinguals: Early influences and a lasting legacy. H. H. Goddard and the menace of the feeble-minded The testing of newly arrived immigrants at Ellis Island Lewis Terman and the Stanford-Binet America gives birth to the IQ test of inherited intelligence Robert Yerkes and mass mental testing Emergence of the bilingualethnic minority handicap Early Research Foundations for ELL Evaluation Mean Mental Age (MA) from Binet Scales in a non-native English speaking sample from Yerkes data as analyzed by C.C. Brigham (1921) Yerkes, 1921 15

14 13.74 13.50 13.08 12 On Sta nf or dBi net Menta l Age Av era ge 13 12.53 11.70 11.29 11 10 Average score for native English speakers on Beta = 101.6 (Very Superior; Grade A) Average score for non-native English speakers on Beta = 77.8 (Average; Grade C) Bilingualism and Testing

Interpretation: New immigrants are inferior Instead of considering that our curve indicates a growth of intelligence with increasing length of residence, we are forced to take the reverse of the picture and accept the hypothesis that the curve indicates a gradual deterioration in the class of immigrants examined in the army, who came to this country in each succeeding 5 year period since 1902The average intelligence of succeeding waves of immigration has become progressively lower. Brigham, 1923 The NASP Practice Model Nondiscriminatory Assessment Options Records Review Work Samples Portfolios Interviews (parent/teacher) Observations (classroom) Progress-Monitoring Data Curriculum-Based Measurement Dynamic Assessment State-mandated Test Scores Language Proficiency Test Scores ADVANTAGES Measures what student has actually been taught Permits examination of learning and progress Information is more relevant to instruction/intervention Relies on a wide and diverse range of data/information Avoids dealing with test validity issues DISADVANTAGES Standard for true peer comparison remains problematic Not many training programs teach dynamic assessment Eligibility criteria may be difficult to demonstrate Difficult to ascertain average unless can use local norms

Requires significant foundational knowledge of issues ADVANTAGES Measures what student can actually compared to others Eligibility criteria easier to demonstrate with numbers Individualized Test Data DISADVANTAGES Standard for true peer comparison remains problematic Requires significant foundational knowledge of issues Nondiscriminatory Assessment Framework I. Assess for the purpose of intervention II. Assess initially with authentic and alternative procedures III. Assess and evaluate the learning ecology IV. Assess and evaluate language proficiency V. Assess and evaluate opportunity for learning VI. Assess and evaluate relevant cultural and linguistic factors Addresses concerns regarding fairness and equity in the assessment process VII. Evaluate, revise, and re-test hypotheses VIII. Determine the need for and language(s) of formal assessment IX. Reduce potential bias in traditional assessment practices X. Support conclusions via data convergence and multiple indicators Pre-referral procedures (I. VIII.) Post-referral procedures (IX. X.)

Addresses possible bias in use of test scores The Provision of School Psychological Services to Bilingual Students This document represents the very first official position by NASP on school psychology services to bilingual students was adopted in 2015. It serves as official policy of NASP and is applicable to ALL school psychologists, whether or not they are bilingual themselves. Fundamental Requirements for Evaluation According to the NASP Position Statement: NASP promotes the standards set by the Individuals with Disabilities Education Improvement Act (IDEA, 2004) that require the use of reliable and valid assessment tools and procedures. (p. 2; emphasis added). NASP (2015). Position Statement: The Provision of School Psychological Services to Bilingual Students. Retrieved from http://www.nasponline.org/x32086.xml Whats the Problem with Tests and Testing with ELs? For native English speakers, growth of cognitive abilities and knowledge acquisition are tied closely to age and assumes normal educational experiences. Thus, agebased norms effectively control for variation in development and provide an appropriate basis for comparison. However, this is not true for English learners who may neither live in a mainstream culture nor benefit to an equivalent degree from

formal education as native English speakers. Development Varies by Experience Not necessarily by race or ethnicity The key consideration in distinguishing between a difference and a disorder is whether the childs performance differs significantly from peers with similar experiences. (p. 105) - Wolfram, Adger & Christian, 1999 For ELs, the Problem is Test Score Validity POTENTIAL BIAS NO BIAS Items Construct Validity (content, novelty) (nature and specificity of the intended/measured constructs) Structure (sequence, order, difficulty) Even when a the intended variable is measured, inferences and interpretation may not be valid if comparability in development is lacking Reliability (measurement error/accuracy) Factor structure (theoretical structure, relationship of variables to each other)

Predictive Validity (correlation with academic success or achievement) Differential Item Functioning (DIF is not often found) Interpretive Invalidity (it can undermine the validity of evaluative judgments and meaning assigned to scores) As long as tests do not at least sample in equal degree a state of saturation [assimilation of fundamental experiences and activities] that is equal for the norm children and the particular bilingual child it cannot be assumed that the test is a valid one for the child. Sanchez, 1934 Test Score Validity and Defensible Interpretation Requires True Peer Comparison Example of Potential Construct Invalidity: Example of Potential Interpretive Invalidity: Using these blocks, arrange them together in the correct manner so they are identical to this picture. A test designed to measure visual processing (Gv) in ELs must avoid overreliance on language ability (Gc) or else measurement of visual processing may be confounded with language ability. After putting a blue block on top of a purple one, put the green block on the blue one.

A test designed to measure English language ability (Gc) is valid for ELs ability in English, but poor performance cannot be ascribed to a potential disability unless developmental differences in English have been controlled. Main Threats to Test Score Validity for ELLs Acculturative Knowledge Acquisition Not Race or Ethnicity When a childs general background experiences differ from those of the children on whom a test was standardized, then the use of the norms of that test as an index for evaluating that childs current performance or for predicting future performances may be inappropriate. Salvia & Ysseldyke, 1991 Developmental Language Proficiency Not Language Dominance Most studies compare the performance of students from different ethnic groupsrather than ELL and non-ELL children within those ethnic groups.A major difficulty with all of these studies is that the category Hispanic includes students from diverse cultural backgrounds with markedly different English-language skills.This reinforces the need to separate the influences of ethnicity and ELL status on observed score differences. Lohman, Korb & Lakin, 2008 Processes and Procedures for Addressing Test Score Validity IX. REDUCE BIAS IN TRADITIONAL TESTING PRACTICES Exactly how is evidence-based, nondiscriminatory assessment conducted and to what extent is there any research to support the use of any of these methods in being capable of establishing sufficient validity of the obtained results? Modified Methods of Evaluation Modified and altered assessment Nonverbal Methods of Evaluation Language reduced assessment Dominant Language Evaluation: L1 Native language assessment Dominant Language Evaluation: L2 English language assessment

Processes and Procedures for Addressing Test Score Validity Modified and Altered Assessment According to the NASP Position Statement: monolingual school psychologists will require training in the use of interpreters in all aspects of the assessment process, as well as an awareness of the complexity of issues that may be associated with reliance on interpreters (p. 2). Processes and Procedures for Addressing Test Score Validity ISSUES IN MODIFIED METHODS OF EVALUATION Modified and Altered Assessment: use of a translator/interpreter for administration helps overcome the language barrier but is also a violation of standardization and undermines score validity, even when the interpreter is highly trained and experienced; tests are not usually normed in this manner in efforts to help the examinee perform to the best of his/her ability, any process involving testing the limits where there is alteration or modification of test items or content, mediation of task concepts prior to administration, repetition of instructions, acceptance of responses in either languages, or elimination/modification of time constraints, etc., violates standardization even when permitted by the test publisher except in cases where separate norms for such altered administration are provided any alteration of the testing process violates standardization and effectively invalidates the scores which precludes interpretation or the assignment of meaning by undermining the psychometric properties of the test alterations or modifications are perhaps most useful in deriving qualitative informationobserving behavior, evaluating learning propensity, evaluating developmental capabilities, analyzing errors, etc. a recommended procedure would be to administer tests in a standardized manner first, which will potentially allow for later interpretation, and then consider any modifications or alterations that will further inform the referral questions because the violation of the standardized test protocol introduces error into the testing process, it cannot be determined to what extent the procedures aided or hindered performance and thus the results cannot be defended as valid Processes and Procedures for Addressing Test Score Validity

Nonverbal Assessment According to the NASP Position Statement: the use of nonverbal tools or native language instruments are not automatic guarantees of reliable and valid data. Nonverbal tests rely on some form of effective communication between examiner and examinee, and may be as culturally loaded as verbal tests, thus limiting the validity of evaluation results. (p. 2). Processes and Procedures for Addressing Test Score Validity ISSUES IN NONVERBAL METHODS OF EVALUATION Language Reduced Assessment: nonverbal testing: use of language-reduced ( or nonverbal) tests are helpful in overcoming the language obstacle, however: it is impossible to administer a test without some type of communication occurring between examinee and examiner, this is the purpose of gestures/pantomime some tests remain very culturally embeddedthey do not become culture-free simply because language is not required for responding construct underrepresentation is common, especially on tests that measure fluid reasoning (Gf), and when viewed within the context of CHC theory, some batteries measure a narrower range of broad cognitive abilities/processes, particularly those related to verbal academic skills such as reading and writing (e.g., Ga and Gc) and mathematics (Gq) all nonverbal tests are subject to the same problems with norms and cultural content as verbal teststhat is, they do not control for differences in acculturation and language proficiency which may still affect performance, albeit less than with verbal tests language reduced tests are helpful in evaluation of diverse individuals and may provide better estimates of true functioning in certain areas, but they are not a whole or completely satisfactory solution with respect to fairness and provide no mechanism for establishing whether the obtained test results are valid or not Processes and Procedures for Addressing Test Score Validity Native Language Assessment (L1) According to the NASP Position Statement: NASP supports the rights of bilingual students who are referred for a psychoeducational evaluation to be assessed in their native languages

when such evaluation will provide the most useful data to inform interventionsFurthermore, the norms for native language tests may not represent the types of ELLs typically found in U.S. schools, and very limited research exists on how U.S. bilingual students perform on tests in their native language as opposed to English. (p. 2). Processes and Procedures for Addressing Test Score Validity ISSUES IN DOMINANT LANGUAGE EVALUATION: Native language Native Language Assessment (L1): generally refers to the assessment of bilinguals by a bilingual psychologist who has determined that the examinee is more proficient (dominant) in their native language than in English being dominant in the native language does not imply age-appropriate development in that language or that formal instruction has been in the native language or that both the development and formal instruction have remained uninterrupted in that language although the bilingual psychologist is able to conduct assessment activities in the native language, this option is not directly available to the monolingual psychologist native language assessment is a relatively new idea and an unexplored research area so there is very little empirical support to guide appropriate activities or upon which to base standards of practice or evaluated test performance whether a test evaluates only in the native language or some combination of the native language and English (i.e., presumably bilingual), the norm samples may not provide adequate representation or any at all on the critical variables (language proficiency and acculturative experiences)bilinguals in the U.S. are not the same as monolinguals elsewhere without a research base, there is no way to evaluate the validity of the obtained test results and any subsequent interpretations would be specious and amount to no more than a guess ELL Test Performance: Esparza Brown Study Comparison of Order of Means for WJ III and Bateria III Classifications* WJ III Classifications Mean Subtest Bateria III Classifications (NLD)

Mean Subtest Bateria III Classifications (ELD) Mean Subtest 98 Gv Visual Processing 111 Ga Auditory Processing 107 Ga Auditory Processing 95 Gs Processing Speed 102 Gv Visual Processing 103 Gv Visual Processing 95

Gsm Short Term Memory 99 Gs Processing Speed 95 Gs Processing Speed 92 Gf Fluid Reasoning 95 Gf Fluid Reasoning 95 Gf Fluid Reasoning 89 Ga Auditory Processing 90 Glr Long Term Memory 82 Gsm Short Term Memory 89

Glr Long Term Memory 88 Gsm Short Term Memory 77 Glr Long Term Memory 85 Gc Crystallized Knowledge 85 Gc Crystallized Knowledge 73 Gc Crystallized Knowledge *Source: Esparza Brown, J. (2008). The use and interpretation of the Bateria III with U.S. Bilinguals. Unpublished dissertation, Portland State University, Portland, OR. ELL Test Performance: Esparza Brown Study Comparison of Bateria III Cluster Means for ELLs by Language of Instruction 110 100 90 80 70

60 GIA Ga Gv Native Language Instruction Gs Gf English Language Instruction Gsm Glr Gc Norm Sample *Source: Esparza Brown, J. (2008). The use and interpretation of the Bateria III with U.S. Bilinguals. Unpublished dissertation, Portland State University, Portland, OR. Processes and Procedures for Addressing Test Score Validity English Language Assessment (L2) According to the NASP Position Statement: monolingual, English-speaking school psychologists will likely conduct the vast majority of evaluations with bilingual students. Therefore, proper training in the requisite knowledge and skills for culturally and linguistically responsive assessment is necessary for all school psychologists. (p. 2; emphasis added). Processes and Procedures for Addressing Test Score Validity ISSUES IN DOMINANT LANGUAGE EVALUATION: English

English Language Assessment (L2): generally refers to the assessment of bilinguals by a monolingual psychologist who had determined that the examinee is more proficient (dominant) in English than in their native language or without regard to the native language at all being dominant in the native language does not imply age-appropriate development in that language or that formal instruction has been in the native language or that both the development and formal instruction have remained uninterrupted in that language does not require that the evaluator speak the language of the child but does require competency, training and knowledge, in nondiscriminatory assessment including the manner in which cultural and linguistic factors affect test performance evaluation conducted in English is a very old idea and a well explored research area so there is a great deal of empirical support to guide appropriate activities and upon which to base standards of practice and evaluate test performance the greatest concern when testing in English is that the norm samples of the tests may not provide adequate representation or any at all on the critical variables (language proficiency and acculturative experiences) dominant English speaking ELLs in the U.S. are not the same as monolingual English speakers in the U.S. with an extensive research base, the validity of the obtained test results may be evaluated (e.g., via use of the Culture-Language Interpretive Matrix) and would permit defensible interpretation and assignment of meaning to the results Current Approaches Fail to Establish Test Score Validity Evaluation Issues and Methods Norm sample representati ve of bilingual developmen t Measures a wider range of schoolrelated

abilities Does not require the evaluator to be bilingual Adheres to the tests standardize d protocol Substantial Sufficient to research identify or base on diagnosis bilingual disability performance Accounts for variation in bilingual developmen t Most likely to yield reliable and valid data and information Provides

extensive data regarding developmen t Modified or Altered Assessment Language Reduced Assessment Dominant Language Assessment in L1: native only Dominant Language Assessment in L2: English only All approaches are limited in some manner when addressing test score validity and none are sufficient to diagnosis a disability, account for variation in bilingual development, represent a form or manner that automatically yields reliable and valid results, and do not provide extensive data regarding cognitive and

school-based learning and development. Test Score Validity and Defensible Interpretation Requires True Peer Comparison For native English speakers, growth of language-related abilities are tied closely to age because the process of learning a language begins at birth and is fostered by formal schooling. Thus, age-based norms effectively control for variation in development and provide an appropriate basis for comparison. However, this is not true for English learners who may begin learning English at various points after birth and who may receive vastly different types of formal education from each other. Development Varies by Exposure to English Not dominance It is unlikely that a second-grade English learner at the early intermediate phase of language development is going to have the same achievement profile as the native English-speaking classmate sitting next to her. The norms established to measure fluency, for instance, are not able to account for the language development differences between the two girls. A second analysis of the students progress compared to linguistically similar students is warranted. (p. 40) - Fisher & Fry, 2012 The validity of an interpretation regarding disability requires an unbiased standard for comparison. Compared to this group, Chaseitos score is at the 9th percentile rank. Compared to this group, Panchitos score is at the 1st percentile rank. 50

RED LINE = Distribution of scores for native English student performance 84 16 98 2 >99 <1 -3SD -2SD -1SD X +1SD +2SD +3SD Using an inappropriate comparison group makes it appear incorrectly that both Chaseito and Panchito may have some type of disability. The validity of an interpretation regarding disability requires an unbiased standard for comparison.

Compared to this group, Panchitos score is still likely to be low even if he is receiving L1 instruction Compared to this group, Chaseitos score is still likely to be low even if he is receiving L1 instruction GREEN LINE = Distribution of scores for native Spanish student performance 50 84 16 98 2 >99 <1 -3SD -2SD -1SD X +1SD +2SD

+3SD Use of a native-language group remains an inappropriate comparison and continues to make it appear incorrectly that both Chaseito and Panchito have some type of disability. The validity of an interpretation regarding disability requires an unbiased standard for comparison. Chaseitos score Compared to a true peer group, his score is at the 46th percentile rank Panchitos score Compared to a true peer group, his score is at the 9th percentile rank BLUE = Distribution of scores for ELL student performance <1 -3SD 50 16 16 2 50 84

84 98 2 98 >99 <1 -3SD -2SD PURPLE = Distribution of scores for native English or native Spanish student performance -2SD -1SD -1SD X X +1SD +1SD +2SD +2SD +3SD >99

+3SD Use of a true peer group provides a non-discriminatory comparison and suggests that Chaseitos performance is average and that only Panchito might have some type of disability. The validity of an interpretation regarding disability requires an unbiased standard for comparison. Whatever method or approach may be employed in evaluation of ELLs, the fundamental obstacle to nondiscriminatory interpretation rests on the degree to which the examiner is able to defend claims of test score construct validity that is being used to support diagnostic conclusions. This idea is captured by and commonly referred to as a question of: DIFFERENCE vs. DISORDER? Simply absolving oneself from responsibility of establishing test score validity, for example via wording such as, all scores should be interpreted with extreme caution does not in any way provide a defensible argument regarding the validity of obtained test results and does not permit valid diagnostic inferences or conclusions to be drawn from them. The only manner in which test score validity can be evaluated or established to a degree that permits valid and defensible diagnostic inferences with ELLs is to use a comparison standard that represents true peers. Evidence-Based Assessment According to the APA Task Force on Evidence-based practice in psychology (EBPP), evidence-based practice is defined as: the integration of the best available research with clinical expertise in the context of patient characteristics, culture, and preferences (p. 273) Evidence-based practice within the context of psychoeducational evaluation has never gone much beyond an over-reliance on the validity of standardized tests. But without inherently fair norm samples, the only recourse for individual practitioners is to apply

research on the use of standardized tests with English learners. This becomes, in effect, evidence-based assessment. Source: American Psychological Association (2006). Evidence-Based Practice in Psychology, American Psychologist, pp. 271-285. Summary of Research on the Test Performance of English Language Learners Research conducted over the past 100 years on ELLs who are non-disabled, of average ability, possess moderate to high proficiency in English, and tested in English, has resulted in two robust and ubiquitous findings: 1. Native English speakers perform better than English learners at the broad ability level (e.g., FSIQ) on standardized, norm-referenced tests of intelligence and general cognitive ability. 2. English learners tend to perform significantly better on nonverbal type tests than they do on verbal tests (e.g., PIQ vs. VIQ). So what explains these findings? Early explanations relied on genetic differences attributed to race even when data strongly indicated that the test performance of ELLs was moderated by the degree to which a given test relied on or required age- or grade-expected development in English and the acquisition of incidental acculturative knowledge. Research Foundations for EL Evaluation Principle 1: Native English Speakers perform better than ELs at the broad ability level. Mean FSIQ by Group Sample 105 100 95 90 85 80 75 W

S& 2 3 01 n L -E n o a St nd di ar z io at n S pl am e

W S& 20 14 no EL n- R re er ef d no te l ig ib le W S& 20 13

EL (w i th di s i ab y) lit W S& 20 14 EL (w ith di s i ab

y) l it Styck, K. M. & Watkins, M. W. (2013). Diagnostic Utility of the Culture-Language Interpretive Matrix for the Wechsler Intelligence Scales for ChildrenFourth Edition Among Referred Students. School Psychology Review, 42(4), 367-382. Research Foundations for EL Evaluation Principle 2: ELs perform better on nonverbal tests than verbal tests Mean WISC-IV Indexes for Non-EL and EL Group Samples 100 95 90 85 80 W S& 1 20 3 n L -E n o an St

d di ar z io at n S pl am e W S& 2 4 01 no EL n- R re

er ef d no te i l ig bl e W S& PRI PSI 20 13 WMI EL (w ith di

s i ab y) l it W S& 20 14 EL (w i th di s i ab y) l it VCI Styck, K. M. & Watkins, M. W. (2013). Diagnostic Utility of the Culture-Language Interpretive Matrix for the Wechsler Intelligence Scales for ChildrenFourth Edition Among Referred Students. School Psychology Review, 42(4), 367-382.

Research Foundations for ELL Evaluation Historical and contemporary research has tended to ignore the fact that ELLs do not perform at the same level on ALL nonverbal tests any more than they perform at the same level on ALL verbal tests. Instead, it appears that test performance of ELLs is not a dichotomy but rather a continuum formed by a linear, not dichotomous, attenuation of performance. This means, a third principle is evident in the body of research on ELLs but has not been well understood or utilized in understanding test performance: 3. Test performance of ELLs is moderated by the degree to which a given test relies on or requires age- or grade-expected English language development and the acquisition of incidental acculturative knowledge. Research Foundations for ELL Evaluation EL test performance is a linear, continuous pattern, not a dichotomy. Developmental Linguistic and Knowledge Requirements of a Test Moderate Low High The more a test requires age-based developmental language proficiency and acculturative knowledge, the more the effect on test performance. SS = 100 95

90 85 80 For ELs, tests that require little or no level of agebased acquisition of language and cultural knowledge yield scores at or close to the normative mean. For ELs, tests that require full or high levels of agebased acquisition of language and cultural knowledge yield scores much lower than the normative mean. Research Foundations for ELL Principle 3: ELL performance is moderated by linguistic/acculturative variables Evaluation Mean WISC-IV Subtest Scores for Non-EL and EL Group Samples 100 98 96 94 92 90 88 86 84 82

80 W S& 1 20 3 n L -E n o an St d di ar z io at n S pl am

e W S& pcn 2 4 01 mr no EL n- R re er ef ss d no te i l ig

bl e W S& bd cd 20 13 EL co (w i th di s i ab y) lit W S&

ln si 20 ds 14 EL (w i th di s i ab y) l it vo Styck, K. M. & Watkins, M. W. (2013). Diagnostic Utility of the Culture-Language Interpretive Matrix for the Wechsler Intelligence Scales for ChildrenFourth Edition Among Referred Students. School Psychology Review, 42(4), 367-382. Research Foundations for ELL Principle 3: ELL performanceEvaluation is moderated by linguistic/acculturative variables Hispanic Group

(Mercer) (1972) Subtest Name Information Vocabulary Similarities Comprehension Digit Span Arithmetic Picture Arrangement Block Design Object Assembly Picture Completion Coding Hispanic Group ESL Group (Vukovich & Figueroa) (Cummins) (1982) (1982) Bilingual Group (Nieves-Brull) (2006) Mean SS Mean SS Mean SS Mean SS 7.5

8.0 7.6 7.8 8.3 8.7 9.0 9.5 9.6 9.7 9.6 7.8 8.3 8.8 9.0 8.5 9.4 10.3 10.8 10.7 9.9 10.9 5.1 6.1 6.4 6.7 7.3 7.4 8.0 8.0 8.4 8.7 8.9 7.2

7.5 8.2 8.0 * 7.8 9.2 9.4 9.3 9.5 9.6 *Data for this subtest were not reported in the study. Research Foundations for ELL Principle 3: ELL performance is moderated by linguistic/acculturative variables Evaluation (n=66) (avg. n=222) 2013 Styck & Watkins (n=86) 100 2014 Styck & Watkins (n=69) 100 98 96 95 94 90

92 90 85 88 80 86 84 75 pcn mr ss bd cd co ln si ds (n=328 ) vo

pcn mr ss bd cd co in si ds vo 1972 Mercer(n=690 ) 100 98 96 94 92 90 88 86 84 pc cd

oa bd pa ar ds vo co si in Research Foundations for EL Principle 3: EL performanceEvaluation is moderated by linguistic/acculturative variables Highest Language Tier 5 Demands Tier 4 Tier 3 Tier 2 Lowest Language Tier 1 Demands *Source: Cormier, D.C., McGrew, K.S. & Ysseldyke, J. E. (2014). The Influences of Linguistic Demand and Cultural Loading on Cognitive Test Scores. Journal of Psychoeducational

Assessment, 32(7), 610-623. Research Foundations for EL Principle 3: EL performanceEvaluation is moderated by linguistic/acculturative variables 110 WJ III GIA 100 Mean WJ III GIA across the four levels of language proficiency on the New York State ESL Achievement Test 101.0 89.55 90 82.29 80 71.75 70 60 50 Proficient Advanced Intermediate NYSESLAT Level Source: Sotelo-Dynega, M., Ortiz, S.O., Flanagan, D.P., Chaplin, W. (2013). Beginner

Diff is about 1SD (15 pts) on average But can be as much as 2SD (30 pts) Research Foundations for ELL Evaluation Principle 3: ELL performance is moderated by linguistic/acculturative variables Domain specific scores across the seven WJ III subtests according to language proficiency level on the NYSESLAT 110 100 90 80 The less developmental proficiency compared to monolingual native English speakers, the more test performance 70 as a function of the linguistic drops demands of the tests administered. 60 SR (Gv) VM (Gs) NR (Gsm) Proficient SB (Ga)

Advanced VAL (Glr) Intermediate CF (Gf) VC (Gc) Beginner Source: Sotelo-Dynega, M., Ortiz, S.O., Flanagan, D.P., Chaplin, W. (2013). English Language Proficiency and Test Performance: Evaluation of bilinguals with the Woodcock-Johnson III Tests of Cognitive Ability. Psychology in the Schools, Vol 50(8), pp. 781-797. Research Foundations for ELL Evaluation Principle 3: ELL performance is moderated by linguistic/acculturative variables Mean subtest scores across the four WASI subtests and four WMLS-R subtests according to language proficiency level 110 100 90 80 70 The less developmental proficiency compared to 60 monolingual native English speakers, the more test performance drops as a function of the 50 linguistic demands of the tests administered. 40 MR

BD LWI Low Proficiency ANA DICT Intermediate Prof. SIM VOC PIC High Proficiency Source: Dynda, A. M. (2008). The relation between language proficiency and IQ test performance. Unpublished manuscript. St. Johns University, NY. Summary of the Foundational Research Principles of the Culture-Language Interpretive Matrix Principle 1: EL and non-ELs perform differently at the broad ability level on tests of cognitive ability. Principle 2: ELs perform better on nonverbal tests than they do on verbal tests. Principle 3: EL performance on both verbal and nonverbal tests is moderated by linguistic Because theand basic research variables. principles underlying the C-LIM are well supported, it acculturative

means that use of the C-LIM is valid and renders it an example of evidence-based practice. This does not mean, however, that it cannot be improved. Productive research on EL test performance can assist in making any necessary adjustments to the order of the means as arranged in the C-LIM. Likewise, as new tests come out, new research is needed to determine the relative level of EL performance as compared to other tests with established values of expected average performance. Ultimately, only research that focuses on stratifying samples by relevant variables such as language proficiency, length and type of English and native language instruction, and developmental issues related to age and grade of first exposure to English, will serve useful in furthering knowledge in this area and assist in establishing appropriate expectations of test performance for specific populations of ELs. Fairness in Determining Average Performance Typical average Range for Non-EL SS=100 SS=100 Typical average SS=99 SS=97 Range for ELLs SS=100 SS=95 SS=100 SS=100 SS=92 SS=90 SS=100 SS=89 SS=100 SS=100

SS=86 SS=83 SS=100 SS=80 16 80 2 70 <1 Tests with low language demands Tests with mod language demands Vocabulary 90 Similarities 50 Information 100 Arithmetic

86 Letter-Number Seq. 110 Digit Span 98 Block Design 120 Picture Concepts 99+ Matrix Reasoning 130 Tests with high language demands The Culture-Language Interpretive Matrix (C-LIM) Important Facts for Use and Practice The C-LIM is not a test, scale, measure, or mechanism for making diagnoses. It is a visual representation of current and previous research on the test performance of English learners arranged by mean values to permit examination of the combined influence of acculturative knowledge acquisition and limited English proficiency and its impact on test score validity. The C-LIM is not a language proficiency measure and will not distinguish native English speakers from English learners with high, native-like English proficiency and is not designed to determine if someone is or is not an English learner. Moreover, the C-LIM is not for use with individuals who are native English speakers. The C-LIM is not designed or intended for diagnosing any particular disability but rather as a tool to

assist clinicians in making decisions regarding whether ability test scores should be viewed as indications of actual disability or rather a reflection of differences in language proficiency and acculturative knowledge acquisition. The primary purpose of the C-LIM is to assist evaluators in ruling out cultural and linguistic influences as exclusionary factors that may have undermined the validity of test scores, particularly in evaluations of SLD or other cognitive-based disorders. Being able to make this determination is the primary and main hurdle in evaluation of ELLs and the C-LIMs purpose is to provide an evidence-based method that assists clinicians regarding interpretation of test score data in a nondiscriminatory manner. Evaluation Resources for Evaluation of English Learners The following documents may be freely downloaded at the respective URLs. Note that the information contained in the packets is Copyright Ortiz, Flanagan, & Alfonso and may not be published elsewhere without permission. However, permission is hereby granted for reproduction and use for personal, not-for-profit, educational purposes only. General C-LIM web site with full file listing: http://facpub.stjohns.edu/~ortizs/CLIM/ Culture-Language Interpretive Matrix Non-Automated Version (Excel) available at: http://facpub.stjohns.edu/~ortizs/CLIM/CLIM-Basic.xls Culture-Language Interpretive Matrix Tutorial on Instruction and Interpretation in (PowerPoint) available at: http://facpub.stjohns.edu/~ortizs/CLIM/CLIM-Instructions.ppt Culture-Language Interpretive Matrix General in (Word) available at: http://facpub.stjohns.edu/~ortizs/CLIM/CLIM-General.doc Culture-Language Test Classifications Reference List: Complete (Word) available at: http://facpub.stjohns.edu/~ortizs/CLIM/CLTC-Reference-List.doc Culture-Language Interpretive Matrix Sample Validity Statements (Word) available at: http://facpub.stjohns.edu/~ortizs/CLIM/CLIM-Interpretive-Statements.doc Sample Report Using C-LIM Case of Carlos Identified as SLD (Word) available at: http://facpub.stjohns.edu/~ortizs/CLIM/Sample-Report-Carlos-Yes-LD.doc Sample Report Using C-LIM Case of Maria Not Identified as SLD (Word) available at: http://facpub.stjohns.edu/~ortizs/CLIM/Sample-Report-Maria-No-LD.doc The Culture-Language Interpretive Matrix (CLIM) Addressing test score validity for ELLs Translation of Research into Practice

1. The use of various traditional methods for evaluating ELLs, including testing in the dominant language, modified testing, nonverbal testing, or testing in the native language do not ensure valid results and provide no mechanism for determining whether results are valid, let alone what they might mean or signify. 2. The pattern of ELL test performance, when tests are administered in English, has been established by research and is predictable and based on the examinees degree of English language proficiency and acculturative experiences/opportunities as compared to native English speakers. 3. The use of research on ELL test performance, when tests are administered in English, provides the only current method for applying evidence to determine the extent to which obtained results are likely valid (a minimal or only contributory influence of cultural and linguistic factors), possibly valid (minimal or contributory influence of cultural and linguistic factors but which requires additional evidence from native language evaluation), or likely invalid (a primary influence of cultural and linguistic factors). 4. The principles of ELL test performance as established by research are the foundations upon which the C-LIM is based and serve as a de facto norm sample for the purposes of comparing test results of individual ELLs to the performance of a group of average ELLs with a specific focus on the attenuating influence of cultural and linguistic factors. The Culture-Language Interpretive Matrix (CLIM) GENERAL RULES AND GUIDANCE FOR EVALUATION OF TEST SCORE VALIDITY There are two basic criteria that, when both are met, provide evidence to suggest that test performance reflects the primary influence of cultural and linguistic factors and not actual ability, or lack thereof. These criteria are: 1. There exists a general, overall pattern of decline in the scores from left to right and diagonally across the matrix where performance is highest on the less linguistically demanding/culturally loaded tests (low/low cells) and performance is lowest on the more linguistically demanding/culturally loaded tests (high/high cells), and; 2. The magnitude of the aggregate test scores across the matrix for all cells fall within or above the expected range of difference (shaded area around the line) determined to be most representative of the examinees background and development relative to the sample on whom the test was normed. When both criteria are observed, it may be concluded that the test scores are

likely to have been influenced primarily by the presence of cultural/linguistic variables and therefore are not likely to be valid and should not be interpreted. Application of Research as Foundations for the Cultural and Linguistic Classification of Tests and Culture-Language Interpretive Matrix PATTERN OF EXPECTED PERFORMANCE FOR ENGLISH LANGUAGE LEARNERS DEGREE OF LINGUISTIC DEMAND LOW PERFORMANCE LEAST AFFECTED (MIMIMAL OR NO EFFECT OF CULTURE & LANGUAGE DIFFERENCES) MODERATE HIGH INCREASING EFFECT OF LANGUAGE DIFFERENCE MODERATE HIGH DEGREE OF CULTURAL LOADING LOW INCREASING EFFECT OF CULTURAL DIFFERENCE PERFORMANCE MOST AFFECTED

(LARGE COMBINED EFFECT OF CULTURE & LANGUAGE DIFFERENCES) Application of Research as Foundations for the Cultural and Linguistic Classification of Tests and Culture-Language Interpretive Matrix PATTERN OF EXPECTED PERFORMANCE FOR ENGLISH LANGUAGE LEARNERS DEGREE OF LINGUISTIC DEMAND LOW MODERATE HIGH HIGHEST MEAN SUBTEST SCORES 1 2 3 MODERATE (CLOSEST TO MEAN) HIGH DEGREE OF CULTURAL LOADING LOW 2

3 4 LOWEST MEAN SUBTEST SCORES 3 4 5 (FARTHEST FROM MEAN) Research Foundations for ELL Principle 3: ELL performanceEvaluation is moderated by linguistic/acculturative variables Hispanic Group Hispanic Group ESL Group Bilingual Group (Mercer, (Vukovich & (Cummins, (Nieves-Brull, 1972) Figueroa, 1982) 1982) 2006) Subtest Name Information Vocabulary Similarities Comprehension Digit Span Arithmetic Picture

Arrangement Block Design Object Assembly Picture Completion Coding Scale Score Scale Score Scale Score Scale Score Grand Mean C-LIM Tier 7.5 8.0 7.6 7.8 8.3 8.7 7.8 8.3 8.8 9.0 8.5 9.4 5.1 6.1

6.4 6.7 7.3 7.4 7.2 7.5 8.2 8.0 * 7.8 85 87 89 89 90 92 5 5 4 4 3 3 9.0 10.3 8.0 9.2 96

3 9.5 9.6 9.7 9.6 10.8 10.7 9.9 10.9 8.0 8.4 8.7 8.9 9.4 9.3 9.5 9.6 97 98 97 99 2 2 1 1 *Data for this subtest were not reported in the study. Application of Research as Foundations for the Cultural and Linguistic Classification of Tests and Culture-Language

Interpretive Matrix SAMPLE OF RESEARCH-BASED MEANS REGARDING EXPECTED PERFORMANCE FOR ENGLISH LEARNERS DEGREE OF LINGUISTIC DEMAND LOW MODERATE SS= 99 MODERATE HIGH SS= 97 1 SS= 97 SS= 91 SS= 89 SS= 91 SS= 91 4 3 SS= 85 SS= 89 3

3 2 2 HIGH DEGREE OF CULTURAL LOADING LOW 4 5 Because research is conducted with highly proficient ELs, these values represent performance only for slightly different individuals. Those with less English proficiency will score proportionally lower. Application of Research as Foundations for the Cultural and Linguistic Classification of Tests and Culture-Language Interpretive Matrix RESEARCH-BASED MEANS REGARDING EXPECTED PERFORMANCE FOR ELs BY DEGREE OF DIFFERENCE Degree of Linguistic Demand Low Slightly Different: 3-5 points Moderately Different: 5-7 points Markedly Different: 7-10 points Moderate Slightly Different: 5-7 points Moderately Different: 7-10 points

Markedly Different: 10-15 points High Degree of Cultural Loading Low Slightly Different: 7-10 points Moderately Different: 10-15 points Markedly Different: 15-20 points Moderate Slightly Different: 5-7 points Moderately Different: 7-10 points Markedly Different: 10-15 points Slightly Different: 7-10 points Moderately Different: 10-15 points Markedly Different: 15-20 points Slightly Different: 10-15 points Moderately Different: 15-20 points Markedly Different: 20-25 points High Slightly Different: 7-10 points Moderately Different: 10-15 points Markedly Different: 15-20 points Slightly Different: 10-15 points Moderately Different: 15-20 points Markedly Different: 20-25 points Slightly Different: 15-20 points Moderately Different: 20-25 points

Markedly Different: 25-35 points Slightly Different: Includes individuals with very high levels of English language proficiency (e.g., CALP) and high acculturation, but still not entirely comparable to mainstream U.S. English speakers. Examples include individuals who are third generation in the U.S., have well educated/higher SES parents, have attended dual-language program for at least 6-7 years, or demonstrate native or near native-like proficiency in English language conversation and solid literacy skills. (Not a common category) Moderately Different: Includes individuals with moderate to higher levels of English language proficiency (e.g., advanced BICS/emerging CALP) and typical EL acculturative learning experiences. Examples include individuals who were born or came early to the U.S. with limited English speaking parents, usually from low to very low SES with parents having low or limited literacy even in their own language, generally received formal education in English only or primarily in English since starting school. Markedly Different: Includes individuals with low to very low levels of English language proficiency (e.g., early BICS) or very limited acculturative learning experiences due to unusual influences on development. Examples include extremely low and limited parental SES and education, recently arrival in the U.S. or residence for in the U.S. 3 years or less, lack of prior formal education, exposure to trauma, violence, abuse, neglect, time spent in refugee or resettlement camps, changes in or multiple early languages. Research Foundations of the C-LIM Additional Issues in Evaluation of Test Score Patterns Evaluation of test score validity, particularly in cases where results are possibly valid, includes considerations such as: 1. Is the Tiered graph consistent with the main Culture-Language graph or the other secondary (language-only/culture-only) graphs? 2. Is there any variability in the scores that form the aggregate in a particular cell that may be masking low performance? 3. Is the pattern of scores consistent with a developmental explanation of the examinees educational program and experiences? 4. Is the pattern of scores consistent with a developmental explanation of the examinees linguistic/acculturative learning experiences? Evaluation of results using all graphs, including secondary ones, identification of score variability in relation to CHC domains or task characteristics, and evaluation of educational, cultural, and linguistic developmental experiences assists in determining the most likely cause of score patterns and overall test score validity. The Culture-Language Interpretive Matrix (CLIM) RANGE OF POSSIBLE OUTCOMES WHEN EVALUATING TEST SCORES WITHIN C-LIM Condition A: Overall pattern generally appears to decline across all cells and all cell aggregate scores within or above shaded rangetest scores likely invalid, cultural/linguistic factors are

primary influences, but examinee likely has average/higher ability as data do not support deficits, and further evaluation via testing is unnecessary. Condition B: Overall pattern generally appears to decline across all cells but at least one cell aggregate (or more) is below shaded rangetest scores possibly valid, cultural/linguistic factors are contributory influences, and further evaluation, including in the native language, is necessary to establish true weaknesses in a given domain. Condition C: Overall pattern does not appear to decline across all cells and all cell aggregate scores within or above average rangetest scores likely valid, cultural/linguistic factors are minimal influences, and further evaluation may be unnecessary if no weaknesses exist in any domain. Condition D: Overall pattern does not appear to decline across all cells and at least one cell aggregate (or more) is below average rangetest scores possibly valid, cultural/linguistic factors are minimal influences, and further evaluation, including in the native language, is necessary to establish true weaknesses in a given domain. The Culture-Language Interpretive Matrix (CLIM) RANGE OF POSSIBLE OUTCOMES WHEN EVALUATING TEST SCORES WITHIN C-LIM A general, overall pattern of decline exists? All scores within or above the expected range? All scores within or above the average range? Degree of influence of cultural and linguistic factors

Likelihood that test scores are valid indicators of ability? Condition A Yes Yes No Primary Unlikely Condition B Yes No No Contributor y Possibly* Condition C No

Yes Yes Minimal Likely Condition D No No No Minimal Possibly* *Determination regarding the validity of test scores that are below the expected and average ranges requires additional data and information, particularly results from native language evaluation, qualitative evaluation and analysis, and data from a strong pre-referral process (e.g., progress monitoring data). C-LIM Guidelines for Evaluating Test Scores CONDITION A: General declining pattern, all scores within or above expected range. CULTURE/LANGUAGE INFLUENCE: PRIMARY all test scores are UNLIKELY to be valid. C-LIM Guidelines for Evaluating Test Scores CONDITION A: General declining pattern, all scores within or above expected range. CULTURE/LANGUAGE INFLUENCE: PRIMARY all test scores are UNLIKELY to be valid. C-LIM Guidelines for Evaluating Test Scores

CONDITION B: Generally declining pattern, one or more scores below expected range. CULTURE/LANGUAGE INFLUENCE: CONTRIBUTORY low test scores are POSSIBLY valid. C-LIM Guidelines for Evaluating Test Scores CONDITION B: Generally declining pattern, one or more scores below expected range. CULTURE/LANGUAGE INFLUENCE: CONTRIBUTORY low test scores are POSSIBLY valid. C-LIM Guidelines for Evaluating Test Scores CONDITION C: No declining pattern, all scores within or above average range. CULTURE/LANGUAGE INFLUENCE: MINIMAL all test scores are LIKELY to be valid. C-LIM Guidelines for Evaluating Test Scores CONDITION C: No declining pattern, all scores within or above average range. CULTURE/LANGUAGE INFLUENCE: MINIMAL all test scores are LIKELY to be valid. C-LIM Guidelines for Evaluating Test Scores CONDITION D: No declining pattern, one or more scores below average range. CULTURE/LANGUAGE INFLUENCE: MINIMAL low test scores are POSSIBLY valid. C-LIM Guidelines for Evaluating Test Scores CONDITION D: No declining pattern, one or more scores below average range. CULTURE/LANGUAGE INFLUENCE: MINIMAL low test scores are POSSIBLY valid. C-LIM Additional Interpretive Issues KABC-II DATA FOR TRAN (ENGLISH) C-LIM Additional Interpretive Issues KABC-II DATA FOR TRAN (ENGLISH) CONDITION B: Generally declining pattern, one or more scores below expected range.

CULTURE/LANGUAGE INFLUENCE: CONTRIBUTORY low test scores are POSSIBLY valid. C-LIM Additional Interpretive Issues WJ IV COG DATA FOR HADJI (ENGLISH) C-LIM Additional Interpretive Issues WJ IV COG DATA FOR HADJI (ENGLISH) Expected rate of decline Steeper rate of decline CONDITION B: Generally declining pattern, one or more scores below expected range. CULTURE/LANGUAGE INFLUENCE: CONTRIBUTORY low test scores are POSSIBLY valid. Comparison of Patterns of Performance Among English-Speakers and English-Learners with Mean C-LIM cell aggregates for WPPSI-III subtests arranged by SLD, SLI, and ID degree of cultural loading and linguistic demand 100 95 90 85 80 75 LC-LL

MC-LL HC-LL ES-NL LC-ML EL-NL MC-ML EL-ID EL-SL Source: Tychanska, J., Ortiz, S. O., Flanagan, D.P., & Terjesen, M. (2009), unpublished data.. LC-HL HC-HL Translating Research into Practice Evaluation Issues and Methods Measures a wider range of school-related abilities Does not require the evaluator to be bilingual Adheres to the

tests standardized protocol Substantial research base on bilingual performance Sufficient to identify or diagnosis disability Accounts for variation in bilingual development Most likely to yield reliable and valid data and information Provides extensive data regarding development Modified or Altered

Assessment Reducedlanguage Assessment Dominant Monolingual Assessment in L1: native only Dominant Monolingual Assessment in L2: English only Multilingual Assessment in L1 + L2 Norm sample representative of bilingual development Multilingual Assessment combined with the C-LIM resolves all validity issues, and by applying research on ELL test performance, they can be used to define and establish a true peer reference group for disability-based evaluations. Practical Considerations for Addressing Validity in Disability Evaluation Procedures with ELLs Although the C-LIM provides a true peer comparison base, its use must be integrated within a best practices framework for assessment of English learners. Such a framework is not achieved by any of the previous methods alone (i.e., testing the limits, nonverbal testing, testing in L1), and requires consideration of

other factors that are often overlooked or ignored (e.g., alterations to standard protocol renders scores invalid, only a small fraction of evaluators are bilingual, suitable tests are often not available in languages other than Spanish, ELL students in the U.S. are bilingual not monolingual, etc.). Moreover, research on ELL test performance exists only for tests administered in English, not in the native language. Therefore, in addition to the need to be systematic and research-based, a best practice framework for assessment of English learners must take into account these and other practical issues so as to be applicable and useful for all evaluators, not merely those with bilingual skills and training. Practical Considerations for Addressing Validity in Disability Evaluation Procedures with ELLs 1. 2. The usual purpose of testing is to identify deficits in ability (i.e., low scores) Validity is more of a concern for low scores than average/higher scores because: Test performances in the average range are NOT likely a chance finding and strongly suggests average ability (i.e., no deficits in ability) Test performances that are below average MAY be a chance finding because of experiential or developmental differences and thus do not automatically confirm below average ability (i.e., possible deficits in ability) 3. Therefore, testing in one language only (English or native language) means that: It can be determined that a student DOES NOT have a disability (i.e., if all scores are average or higher, they are very likely to be valid) It CANNOT be determined if the student has a disability (i.e., low scores must be validated as true indicators of deficit ability) 4. Testing in both languages (English and native language) is necessary to determine disability Testing requires confirmation that deficits are not language-specific and exist in both languages (although low performance in both can result from other factors) 5. All low test scores, whether in English or the native language, must be validated

Low scores from testing in English can be validated via research underlying the C-LIM Low scores from testing in the native language cannot be validated with research Practical Considerations for Addressing Validity in Disability Evaluation Procedures with ELLs Given the preceding considerations, the most practical and defensible general approach in evaluating ELLs would be: Test in English first and if all test scores indicate strengths (average or higher) a disability is not likely and thus no further testing is necessary If some scores from testing in English indicate weaknesses, re-test those areas in the native language to support them as areas of true weakness Use all other case data and information to serve as the context by which to evaluate the test scores and ensure ecological validity to conclusions When combined with the C-LIM, this approach provides an efficient, researchbased, and IDEA-compliant process that makes best use of available resources for evaluation consistent with current standards as it permits ANY evaluator to begin (and in some cases, complete) the testing without being bilingual or requiring outside assistance. A Best Practice Framework for Evaluation and Disability Testing with ELLs Step 1. Evaluate construct validity in all areas in English (exclusion of cultural/linguistic factors) Test in English first and use C-LIM to evaluate scores. If all scores indicate strengths (average or higher) a disability is not likely and no further testing is necessary. If any scores suggest weaknesses, continue evaluation. Step 2. Re-evaluate construct validity in areas of weakness in native language (cross-linguistic evidence) If some scores from testing in English indicate weaknesses, re-test those areas in the native language to support them as areas of true weakness Step 3. Cross-validate L1 and L2 test scores with contextual factors and data (ecological validity for disability) Use all other case data and information to serve as the context by which to evaluate the test scores and ensure ecological validity to conclusions

A Guided Case Study Example of Evaluation of an English Learner for Specific Learning Disability Evaluation of Jose Maria Tests Used: WISC-V, WIAT-III, and WJ IV DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 SLD Identification with an English Learner: A Case Study Step 1. Evaluate construct validity in all areas in English (exclusion of cultural/linguistic factors) Test in English first and use C-LIM to evaluate scores. If all scores indicate strengths (average or higher) a disability is not likely and no further testing is necessary. If any scores suggest weaknesses, continue evaluation. Step 2. Re-evaluate construct validity in areas of weakness in native language (cross-linguistic evidence) If some scores from testing in English indicate weaknesses, re-test those areas in the native language to support them as areas of true weakness Step 3. Cross-validate L1 and L2 test scores with contextual factors and data (ecological validity for disability) Use all other case data and information to serve as the context by which to evaluate the test scores and ensure ecological validity to conclusions SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 WECHSLER INTELLIGENCE SCALE FOR CHILDREN-V Verbal Comprehension Index 76 95 Similarities

5 9 Vocabulary 6 9 Working Memory Index Digit Span Picture Span 79 5 7 Fluid Reasoning Index 82 Matrix Reasoning Visual-Spatial Index 7 Figure Weights Processing Speed Index Coding Symbol Search Block Design 7 Visual Puzzles 94 9 8

WECHSLER INDIVIDUAL ACHIEVEMENT TEST-III Basic Reading 92 Word Reading 100 Pseudoword Decoding 86 94 Reading Comprehension 76 92 Reading Comprehension 98 Oral Reading Fluency Essay Composition 93 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Auditory Processing 91 Phonological Processing LT Storage/Retrieval 99 Story Recall 77 79 Written Expression 76

Spelling 80 Sentence Composition SLD Identification with an English Learner: A Case Study C-LIM is used to interpret pattern of results with respect to whether scores were primarily influenced but cultural/linguistic factors or not. SLD Identification with an English Learner: A Case Study Tiered graph shows minimal decline and below expected results not fully explainable by cultural/linguistic factors alonesome other factor must be present. SLD Identification with an English Learner: A Case Study C-L Graph also shows disrupted declining pattern and reinforces conclusion that results are not primarily attributable to cultural and linguistic factors SLD Identification with an English Learner: A Case Study A general, overall pattern of decline exists? All scores within or above the expected range? All scores

within or above the average range? Conclusion Conditionfrom evaluation of score pattern in the Yes Yes No A C-LIM is consistent with Condition D which indicates lack of general, overall pattern of decline and at least one cell aggregate below the expected range. Therefore, results are only minimally Condition affected by culturalYes and linguisticNo issues and are No B possibly validfurther validation is now required. Degree of influence of cultural and linguistic factors Likelihood that test scores are valid indicators of ability? Primary Unlikely Contributor

y Possibly* Condition C No Yes Yes Minimal Likely Condition D No No No Minimal Possibly* *Final determination regarding the validity of test scores that are below the expected and average ranges requires additional data and information, particularly results from native language evaluation, qualitative evaluation and analysis, and data from a strong pre-referral process (e.g., progress monitoring data). SLD Identification with an English Learner: A Case Study Step 1. Evaluate construct validity in all areas in English (exclusion of cultural/linguistic factors)

Test in English first and use C-LIM to evaluate scores. If all scores indicate strengths (average or higher) a disability is not likely and no further testing is necessary. If any scores suggest weaknesses, continue evaluation. Step 2. Re-evaluate construct validity in areas of weakness in native language (cross-linguistic evidence) If some scores from testing in English indicate weaknesses, re-test those areas in the native language to support them as areas of true weakness Step 3. Cross-validate L1 and L2 test scores with contextual factors and data (ecological validity for disability) Use all other case data and information to serve as the context by which to evaluate the test scores and ensure ecological validity to conclusions SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 WECHSLER INTELLIGENCE SCALE FOR CHILDREN-V Verbal Comprehension Index 76 95 Similarities 5 9 Vocabulary 6 9 Working Memory Index Digit Span Picture Span 79 5 7 Fluid Reasoning Index

82 Matrix Reasoning Visual-Spatial Index 7 Figure Weights Processing Speed Index Coding Symbol Search Block Design 7 Visual Puzzles 94 9 8 WECHSLER INDIVIDUAL ACHIEVEMENT TEST-III Basic Reading 92 Word Reading 100 Pseudoword Decoding 86 94 Reading Comprehension 76 92

Reading Comprehension 98 Oral Reading Fluency Essay Composition 93 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Auditory Processing 91 LT Storage/Retrieval 77 Written Expression 76 80 Only areas of possible deficit Spelling need to be re-evaluated in the native language (e.g., via Sentence Composition use of native language tests, interpreters/translators, etc.). Scores that are average or better do not need to be reevaluated. SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016

DOB: 10/4/2006 Grade: 4 WECHSLER INTELLIEGENCE SCALE FOR CHILDREN-V Verbal Comprehension Index 76 95 Similarities 5 9 Vocabulary 6 9 Fluid Reasoning Index Matrix Reasoning Figure Weights 82 Visual-Spatial Index 7 Block Design 7 Visual Puzzles There are four possible areas of cognitive weakness94 that may suggest deficits Working Memory Index 79 Processing Speed Index related to the reported academic difficulties Digit Span 5

Coding 9 as well as three Picture Span 7 Symbol Search 8 areas of strength. However, because these tests are not designed for English learners, for the WECHSLER INDIVIDUAL ACHIEVEMENT TEST-III areas of suspected weakness it is to generate additional Basic Reading 94 Reading Comprehensionnecessary 76 Written Expression information and data to cross-linguistically 92 that they are true deficits. Word Reading 92 Reading Comprehensionconfirm 76 Spelling Strengths do not support disability 100 not require Pseudoword Decoding 98 Oral Reading Fluencyidentification80and therefore Sentencedo Composition any further validation.

86 Essay Composition 93 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Auditory Processing 91 Phonological Processing LT Storage/Retrieval 99 Story Recall 77 79 SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 WECHSLER INTELLIEGENCE SCALE FOR CHILDREN-V Verbal Comprehension Index 76 95 Similarities 5 9 Vocabulary 6 9 Fluid Reasoning Index Matrix Reasoning Figure Weights 82 Visual-Spatial Index

7 Block Design 7 Visual Puzzles addition, because Gc itself is language, it Processing Speed In Index 94 cannot be compared fairly to native English 5 Coding 9 to determine whether it is a 7 Symbol Search speaker norms 8 strength or weakness even when scores are deemed valid using the C-LIM. Thus, in the WECHSLER INDIVIDUAL ACHIEVEMENT TEST-III case, additional procedures must be employed to determine whether Gc is Basic Reading 94 Reading Comprehension 76 Written Expression actually a true weakness or not and whether 92 it does or does

require re-evaluation. Word Reading 92 Reading Comprehension 76 not Spelling 100 Pseudoword Decoding 98 Oral Reading Fluency 80 Sentence Composition 86 Essay Composition 93 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Working Memory Index Digit Span Picture Span 79 Auditory Processing 91 Phonological Processing LT Storage/Retrieval 99 Story Recall 77 79 SLD Identification with an English Learner: A Case Study Interpretive Problems with Gc Scores with English Learners Because Gc is, by definition, comprised of cultural knowledge and language development,

the influence of these factors cannot be separated from tasks designed to measure them. Thus, unless exposure to English is a controlled variable in a tests norm sample and the sample includes many different languages, Gc scores for ELLs always remain at risk for inequitable interpretation even when the overall pattern of scores within the C-LIM is determined to be valid. For example, a Gc score of 76 would be viewed as deficient relative to a norm sample comprised primarily of native English speakers. Moreover, testing in the native language doesnt solve this problem because current native-language tests treat ELs as being all the same (they arent), as if being behind in English is only temporary (it isnt), as if the country they come from is important (its not), and as if five years of English learning makes them native English speakers (it doesnt). Therefore, practitioners must find and rely on a true peer comparison group such as that which is formed within the High Culture/High Language cell of the C-LIM to help ensure that ELLs are not unfairly regarded as having either deficient Gc ability or significantly lower overall cognitive abilityconditions that may simultaneously decrease identification of SLD and increase suspicion of ID and speech impairment. SLD Identification with an English Learner: A Case Study Determining if and when to re-test Gc via the C-LIM Re-evaluation of suspected areas of weakness is necessary to provide cross-linguistic confirmation of potential deficits in functioning. A disability cannot be identified in an English learner if the observed difficulties occur only in one language. Even then, deficits that are identified in both languages are not definitive evidence of dysfunction and evaluation of expectations for native language performance is as relevant for native language evaluation as it is for evaluation in English. Because of the nature of Gc, it should be treated slightly differently when it comes to re-evaluation as compared to other cognitive abilities. The following guidelines from the best practice recommendations apply specifically to Gc: *Review results from testing in English and identify domains of suspected weakness or difficulty: a. For Gc only, evaluate weakness according to high/high cell in C-LIM or in context of other data and information *For Gc only: a. If high/high cell in C-LIM is within/above expected range, consider Gc a strength and assume it is at least average (re-testing is not necessary) b. If high/high cell in C-LIM is below expected range, re-testing of Gc in the native language is recommended For Gc only, scores obtained in the native language should only be interpreted relative to developmental and

educational experiences of the examinee in the native language and only as compared to others with similar developmental experiences in the native language. It is important that the actual, obtained Gc score, regardless of magnitude, be reported when required, albeit with appropriate nondiscriminatory assignment of meaning, and that it be used for the purposes of instructional planning and educational intervention. *If Gc is evaluated with the Ortiz PVAT, use the actual score obtained from the English Learner norms (NOT the English Speaker norms) to determine if it is an area of weakness. If the score indicates a weakness, it should then be further reevaluated in the native language. SLD Identification with an English Learner: A Case Study Gc performance on the C-LIM Summary Graph is well within the expected average score/range when compared to other English language learner peers, therefore further testing of Gc is not necessary SLD Identification with an English Learner: A Case Study Similarly, Gc performance on the main C-L Graph is well within the expected average score/range when compared to other English language learner peers, therefore further testing of Gc is not necessary SLD Identification with an English Learner: A Case Study Interpretive Problems with Gc Scores with English Learners Although the C-LIM helped determine that Gc is NOT an area of weakness, further evaluation and interpretation is complicated because of the low magnitude of the score (i.e., SS=76). Other corrections are necessary to prevent discriminatory decisions, particularly in evaluation of SLD or SLI. However, use of the Ortiz PVAT provides a simple and more direct solution to all of these problems. English Native Lang. Valid? - Gc 76 - No S - Gf 82 ? ?

- Glr 77 ? ? - Gsm 78 ? ? - Gv 98 - Yes S - Ga 92 - Yes S - Gs 94 - Yes S These are the seven major CHC broad abilities typically measured for evaluation of SLD, particularly within a Processing Strengths and Weaknesses (PSW) approach. The parentheses contain the corresponding five WISC indexes that are equivalent to the CHC broad abilities. Interpretation? Since the aggregate score in the C-LIM for Tier 5 (i.e., the High/High cell where all Gc tests are classified) was within the expected range corresponding to the selected degree of difference deemed most appropriate, it should be considered a strength despite the fact that the magnitude is only 76 and that it isnt technically a valid measure of intrinsic language-related abilities. SLD Identification with an English Learner: A Case Study Resolving Problems with Gc Scores for ELs: The Ortiz PVAT Clearly, the preceding procedures necessary to address validity issues related to the measurement of Gc and language/culture-related abilities are complicated, somewhat cumbersome, and not very efficient. It may also leave the practitioner in the unenviable position of having to defend a very low score (SS=76) as being technically invalid, but

still considered to be an area of processing strength. This one issue, more than any other, best highlights the shortcomings of todays tests relative to their failure to provide a true peer comparison group for English learners that would alleviate all of the extra work and potential confusion. There simply is no substitute for being able to make fair and equitable interpretations than comparison to peers with similar developmental experiences. That said, there is in fact an easier way to do all of this. In response to the many difficulties posed by these issues, a new test has been developed with dual-norm samples, including one specifically for English learners that yields valid Gc scores for English learners of any language background and level of English exposureand that test is the Ortiz PVAT. The Ortiz PVAT A new direction in tests and testing. Clinical and Educational Applications of the Ortiz PVAT Diagnostic evaluation provides true peer comparisons for evaluating language-related disabilities/disorders in both English Speakers and Learners Intervention/treatment provides data and specific recommendations for language-based intervention keyed directly to performance relative to peers Instructional guidance provides data and specific teaching and instructional recommendations based on performance relative to grade-level expectations Progress monitoring provides data for documenting progress

across short intervals to evaluate success of instruction and intervention efforts Growth provides data and a specific Index capable of documenting actual growth in vocabulary/language acquisition across short and long intervals The Ortiz PVAT A new direction in tests and testing. The Ortiz PVAT is a computer-based assessment that measures vocabulary acquisition in children and youths aged 2.6 to 22 using English language words and irrespective of the native/heritage language. Perhaps the most unique feature of the Ortiz PVAT is: A dual-norming structure with distinct norms for Englishspeakers and English learners. The EL norms are based on ELs from various language backgrounds and specifically control for amount of English exposure. Copyright 2018 Multi-Health Systems Inc. All rights Performance of English Learners Based on Comparison to English Learner vs. English Speaker Norm Samples Developmental Language/Exposure-based Comparison Provides Validity and Fairness for ELs 105 102.3 100 95

These scores are valid only for determining instructional level and need but are invalid for diagnostic purposes. 100.1 100 100.2 94.1 Only these scores are valid for diagnostic purposes and demonstrate average ability and development. 90.6 90 86.8 85 80 75

English Speaker Norms Monolingual English (100%) Medium Exposure (11-50%) English Learner Norms High Exposure (50-100%) Low Exposure (0-10%) This graph is reproduced from the Technical Manual of the Ortiz PVAT and is Copyright 2017 Multi-Health Systems Inc. All rights reserved. Fairness Across All English Learners: No evidence of bias regarding native language Assumption: English language acquisition is an invariant process, irrespective of the native language This table is reproduced from the Technical Manual of the Ortiz PVAT and is Copyright 2017 Multi-Health Systems Inc. All rights reserved. SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 Although we are adding the Ortiz PVAT WECHSLER INTELLIEGENCE SCALE FOR CHILDREN-V at this point in the evaluation, it would have been easiest to simply include it as Verbal Comprehension Index 76 Fluid Reasoning Index 82a standard Visual-Spatial Index part of any battery particularly 95 because it can be administered to any Similarities

5 Matrix Reasoning 7 Block Design individual to generate a valid Gc score, 9 and in the case of ELs, it will also Vocabulary 6 Figure Weights 7 the Visual Puzzlesthat will always address Gc problem 9 exist and provide that information in an interpretive summary report. Working Memory Index 79 Processing Speed Index 94 Digit Span 5 Coding 9 Picture Span 7 Symbol Search 8 WECHSLER INDIVIDUAL ACHIEVEMENT TEST-III Basic Reading

92 Word Reading 100 Pseudoword Decoding 86 94 Reading Comprehension 76 92 Reading Comprehension 98 Oral Reading Fluency Written Expression 76 Spelling 80 Sentence Composition Essay Composition 93 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Auditory Processing 91 Phonological Processing LT Storage/Retrieval 99 Story Recall

77 79 Ortiz PVAT (EL Norms) 93 SLD Identification with an English Learner: A Case Study Avoiding Interpretive Problems by Use of the Ortiz PVAT Derivation of an Ortiz PVAT score using the English learner norms eliminates the Gc problem completely. The Ortiz PVAT score simply replaces any Gc/language-related/verbal ability score because it was derived precisely on true peers and therefore inherently valid in terms of both meaning/classification and actual magnitude (e.g., 90 - 109 = average). English Spanish Valid? - Gc 76 - No ? - Gf 82 ? ? - Glr 77 ? ? - Gsm 78 ? ? - Gv 98 - Yes S - Ga 92 - Yes S - Gs

94 - Yes S - Gc (Ortiz PVAT) 93 - Yes Interpretation? S Use of the Ortiz PVAT requires no native language confirmation since the score is derived from norms that control for amount of exposure to English and is based on a true peer comparison group for both English speakers and English learners. Therefore, it is valid and may be interpreted directly as a strength or weakness without requiring any further cross-linguistic validation. It also eliminates the potential confusion and difficulty in having to explain why a low score (e.g. 76) is a strength, not a weakness. Nondiscriminatory Interpretation of Test Scores: A Case Study Determining if and when to re-evaluate all other (non-Gc) abilities Because cultural knowledge and language ability are not the primary focus in measurement of other abilities, the influence of cultural/linguistic factors can be determined via the C-LIM and scores below the expected range of performance may well be deemed to be the result of factors other than cultural knowledge or language ability. Thus, there is no limitation requiring comparison of performance to a true ELL peer group as there is with Gc. Thus, use of a tests norms and the attendant standard classification scheme is appropriate for determining areas of suspected weakness using tests administered in English for abilities other than Gc. However, to establish validity for a low score obtained from testing in English with an ELL, native language evaluation is required. The following guidelines from the best practice recommendations apply to all abilities, including Gcwhen Gc has been determined to be a weakness because it falls below the expected range of difference in the C-LIM:* Review results from testing in English and identify domains of suspected weakness or difficulty: a. For all abilities, except Gc, evaluate weakness using standard classifications (e.g., SS < 90) Re-test all domains of suspected weakness, including Gc when it is not within the expected range of difference in the CLIM* using native language tests Administer native language tests or conduct re-testing using one of the following methods: a. Native language test administered in the native language (e.g., WJ III/Bateria III or WISC-IV/WISC-IV Spanish) b. Native language test administered via assistance of a trained interpreter c. English language test translated and administered via assistance of a trained interpreter Administer tests in manner necessary to ensure full comprehension including use of any modifications and alterations necessary to reduce barriers to performance, while documenting approach to tasks, errors in responding, and behavior during testing, and analyze scores both quantitatively and qualitatively to confirm and validate areas as true weaknesses

*Or, if Gc was evaluated with the Ortiz PVAT, the actual score when compared to the English Learner norms (NOT the English Speaker norms) indicates that it is likely an area of weakness. SLD Identification with an English Learner: A Case Study Procedures for Follow-up Evaluation in the Native Language When providing cross-linguistic confirmation of areas of weakness that were found via scores derived from testing in English, it is helpful (but not actually necessary) to generate scores. Qualitative information and data (e.g., process or error analysis, dynamic assessment, task observations, etc.) are equally helpful and useful with respect to confirming areas of weakness. It is also reasonable to use the exact same tests for follow up evaluation in the native language as were initially used in English language evaluation because, in this case, practice effects are diagnostically helpful in terms of discerning learning ability from learning disability. Evaluation in the native language can be accomplished in several different ways and will likely depend on the competency of the evaluator and the available resources. Completion of the task may include one or more of the following procedures: 1. Use of native language tests (if available) administered by a bilingual evaluator 2. Use of native language tests (if available) administered by a trained translator In the absence of parallel or similar native language tests with which to evaluate the necessary domains, follow up evaluation will need to resort to other procedures for task completion, including: 3. Use of English language tests translated directly by a bilingual evaluator 4. Use of English language tests administered via assistance of trained translator 5. Use of informal tasks accompanied by careful observation, error analysis, and other probing with the assistance of a translator for communication. SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 WECHSLER INTELLIEGENCE SCALE FOR CHILDREN-V Verbal Comprehension Index 76 Fluid Reasoning Index 95 Similarities 5 Matrix Reasoning

9 Vocabulary 6 Figure Weights 9 Working Memory Index subtests) 91 Digit Span 8 Picture Span 9 WISC IV Spanish WMI Digit Span Letter-Number Sequencing 79 Processing Speed Index 5 Coding 7 Symbol Search 82 Visual-Spatial Index 7 Block Design 7 Visual Puzzles

94 WISC IV Spanish (Gf 9 8 Matrix Reasoning Picture Concepts 72 5 4 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Auditory Processing 91 LT Storage/Retrieval 77 Ortiz PVAT 93 Phonological Processing 99in the Story Recall Gf, Gsm, and Glr need to be re-tested native language79to provide additional confirmation that they are Nonword Repetition 84 Visual-Auditory Learning true weaknesses. The same or similar tests can be used and scores75may be generated but the main purpose is

to observe performance Bateria IIIqualitatively LT Retrievalin the domain 79 to provide cross-linguistic validation of suspected difficulties. Visual-Auditory Learning 81 SLD Identification with an English Learner: A Case Study WISC-V/WJ IV/WIAT-III XBA DATA FOR Jose Maria DOE: 6/22/2016 DOB: 10/4/2006 Grade: 4 WECHSLER INTELLIEGENCE SCALE FOR CHILDREN-V Verbal Comprehension Index 76 Fluid Reasoning Index 95 Similarities 5 Matrix Reasoning 9 Vocabulary 6 Figure Weights 9 Working Memory Index subtests) 91 Digit Span 8 Picture Span 9 WISC IV Spanish WMI Digit Span Letter-Number Sequencing

79 Processing Speed Index 5 Coding 7 Symbol Search 82 Visual-Spatial Index 7 Block Design 7 Visual Puzzles 94 WISC IV Spanish (Gf 9 8 Matrix Reasoning Picture Concepts 72 5 4 WOODCOCK JOHNSON-IV TESTS OF COGNITIVE ABILITY Auditory Processing

91 LT Storage/Retrieval 77 Ortiz PVAT Phonological Processing 99 Story Recall 79 Nonword Repetition 84 Visual-Auditory Learning 75 Bateria III LT Retrieval Visual-Auditory Learning 79 81 Results of native 93 language testing for Gf, Gsm, and Glr Nondiscriminatory Interpretation of Test Scores: A Case Study Determining which scores are valid and interpretable Average* or higher scores in testing are unlikely to be due to chance. Thus, when a score obtained from native language testing is found to be in the average range or higher, it serves to effectively invalidate the original low score from testing in English since deficits must exist in both languages. Conversely, if another low score in the same domain is obtained from native language evaluation, it may serve to bolster the validity of the original score obtained in English. Based on these premises, the following guidelines from the best practice recommendations offer guidance regarding selection and use of the most appropriate and valid score for the purposes of PSW analysis (or any other situation in which the validity of test scores is central or relevant): For all domains, including Gc, if a score obtained in the native language suggests a domain is a strength (SS > 90), it serves to invalidate/disconfirm the corresponding weakness score obtained in Englishthus, report, use, and interpret the domain score obtained in the native language For all domains, except Gc, if a score obtained in the native language also suggests weakness in the same domain (SS < 90), it serves to validate/confirm the corresponding weakness score obtained in Englishthus, report, use, and interpret the original domain score obtained in English For Gc only, if a score obtained in the native language also suggests weakness in Gc (SS < 90), it may serve to

validate/confirm the corresponding weakness score obtained in English but only if low performance in Gc cannot be attributed to factors related to a lack or interruption of native language instruction and education, low family SES, or other lack of opportunity to learnthus, in the absence of such mitigating factors, report, use, and interpret the domain score obtained in English *Although average or higher (e.g., SS>90) is used as a recommended cutoff for supporting the validity of test scores, use of a lower standard (e.g., SS>85) may also represent a reasonable standard for practice since it is based on performance that can be categorized as being within normal limits. A Recommended Best Practice Approach for Using Tests with ELLs ADDRESSING VALIDITY AND INTERPRETATION OF SCORES GENERATED IN TWO LANGUAGES Original score when tested in English For ALL domains* For ALL domains (and when Gc is below expected range in C-LIM) For ALL domains (and when Gc is below expected range in C-LIM) SS > 90 SS < 90 SS <

90 Follow up score when tested in native language n/a Most appropriate and valid score for use in PSW analysis Original Score (in English) Rationale for Use as Strength or Weakness in PSW Analysis Strengthscores in or above the average range (or even WNL) are unlikely to occur by chance and very likely to be valid thus reevaluation in the native language is unnecessary SS > 90 SS < 90 Follow Up

Score (in native lang) Strengthbecause a deficit cannot exist in one language only, the original score from testing in English is invalidated and should be replaced by the follow up average score which is likely to be valid Weaknesslow scores in both languages suggest a true deficit but additional, convergent and consistent ecological evidence is required to substantiate scores as deficits StrengthGc can only be compared fairly to other ELLs, thus itsscores, position thestandard expected *Although this Gc tableOnly uses average or higher (e.g., SS>90) as a recommended cutoff for supporting the validity of test use within of a lower (e.g., For range in the C-LIM should SS>85)

alsoGc represent a reasonable standard for practice since it is based on performance that can be categorized as being within normal limits.be (andmay when is within considered to be average and the expected range in native language testing may not be C-LIM) necessary unless there is reason SS < 90 n/a SLD Identification with an English Learner: A Case Study Determining which scores are valid and interpretable Derivation of an Ortiz PVAT score using the English learner norms eliminates the Gc problem completely. The Ortiz PVAT score simply replaces any Gc/language-related/verbal ability score because it was derived precisely on EL true peers and therefore inherently valid in terms of both meaning/classification and actual magnitude (e.g., 90 - 109 = average). English Spanish Valid? Interpretation? - Gc 76 - 76 - No

- Gf (82) 91 91 - Yes S - Glr 77 (79) 77 - Yes W - Gsm 78 (72) 78 - Yes W - Gv 98 - Yes S - Ga 92 - Yes S - Gs 94 - Yes S - Gc (Ortiz PVAT) 93 - Yes S Additional native language investigation of areas of weakness noted in scores derived from testing in English (with the exception of the score from the Ortiz PVAT), resulted in an average Gf score that invalidated the original Gf score, and two below average scores that simply cross-linguistically confirmed Glr and Gsm as areas of weakness as indicated by the test scores in English. SLD Identification with an English Learner: A Case Study Step 1. Evaluate construct validity in all areas in English (exclusion of cultural/linguistic factors) Test in English first and use C-LIM to evaluate scores. If all scores indicate strengths (average or higher) a disability is not likely and no further testing is necessary. If any scores suggest weaknesses, continue evaluation. Step 2. Re-evaluate construct validity in areas of weakness in native language (cross-linguistic evidence) If some scores from testing in English indicate weaknesses, re-test those areas in the native language to support them as areas of true weakness

Step 3. Cross-validate L1 and L2 test scores with contextual factors and data (ecological validity for disability) Use all other case data and information to serve as the context by which to evaluate the test scores and ensure ecological validity to conclusions SLD Identification with an English Learner: A Case Study The Importance of Converging Evidence in Establishing Validity Validity is based on an accumulation of evidence. The evaluation approach described herein is designed to assist in generating test scores that may be interpreted as valid indicators of an individuals abilities. Embedded in the broader framework are two basic forms of evidence that bolster the validity of obtained test scores by using expectations of test performance that are grounded in research on individuals of comparable cultural and linguistic backgrounds and the extent to which their development differs from the individuals on whom the tests were normed. Validity is thus inferred by: 1. Test scores from evaluation in English that have been subjected to systematic analysis of the influence of cultural and linguistic variables where such factors have been found to be either minimal or contributory but not primary factors in test performance; 2. Test scores or qualitative data regarding evaluation of weak areas in the native language that either further confirm suspected areas of deficit as being true or dis-confirm suspected areas of deficit due to evidence of average or higher performance. To these two forms of evidence, a third should be added to fully support conclusions and interpretation of the obtained test scores: 3. Ecological and contextual evidence regarding consistency of the test scores with ecological data and information on developmental influences (e.g., L1 and L2 exposure, language of instruction, socio-economic status, parental education level, etc.) and convergence of patterns of performance with other case data (e.g., progress monitoring data, pre-referral concerns, work samples, observations, school records, teacher/parent reports, grades, interviews, observations, etc.). Only when all three forms of evidence are seen to converge can there be sufficient confidence in the use and interpretation of test scores obtained in an evaluation of English learners. SLD Identification with an English Learner: A Case Study Corroborating test scores with additional, converging evidence English

Spanish - Gc 76 No - Gf (82) 91 Yes S - Glr 77 (79) Yes W - Gsm 78 (72) Yes W - Gv 98 Yes S - Ga 92 Yes S - Gs 94 Yes S - Gc (Ortiz PVAT) 93 Yes Valid? Interpretation? S The areas of weakness identified in this case are in the domains of Glr and Gsm. Additional converging evidence that these are true weaknesses comes from both the additional native language evaluation results and corroborating information from pre- or post-referral interviews, record reviews, observations, work samples, etc., all of which are consistent in demonstrating that the individual has problems on tasks that require long-term memory or short-term memory skills (e.g., inconsistent learning, lack of expected fluency and automaticity, easily forgetting things, inability to follow multi-step directions, errors in procedural steps in math calculations, difficulty remembering what was just read, etc.). SLD Identification with an English Learner: A Case Study

Sample Validity Statement for ELL Evaluations The statement above is the one most appropriate for this case where a) the evaluation focused on suspected SLD; and b) where it was determined that the obtained test results were NOT influenced primarily by cultural and linguistic factors, albeit they remained contributory. Thus, the test results (except for Gc) could be considered valid estimates of the abilities that were measured. In addition, native language testing was conducted to further support cognitive test score validity. This statement (and three others) have been placed in the public domain and may be freely copied, modified, and distributed for non-profit purposes without the need to secure permission. Culture-Language Interpretive Matrix: The Importance of the Context of Difference Confidence Interval (95% Subtests Standard Score Band) Descriptions Verbal Comprehension 64 56 72 Very Low Visual-Auditory Learning 88 76 100

Low Average Spatial Relations 98 91 107 Average Sound Blending 75 64 87 Low Concept Formation 70 62 78 Low Visual Matching 86 76 97 Low Average Numbers Reversed

80 67 93 Low Incomplete Words 78 65 91 Low Auditory Working Memory 85 76 94 Low Average Analysis-Synthesis 78 66 90 Low Auditory Attention 81 67 95

Low Decision Speed 72 63 81 Low Retrieval Fluency 82 69 95 Low General Information 69 60 78 Very Low Culture-Language Interpretive Matrix: The Importance of the Context of Difference Culture-Language Interpretive Matrix: The Importance of Difference Culture-Language Interpretive Matrix: The Importance of the Context of Difference

Culture-Language Interpretive Matrix: The Importance of the Context of Difference Evidence-based Assessment Used in conjunction with other information relevant to appropriate bilingual, crosscultural, nondiscriminatory assessment including - level of acculturation language proficiency socio-economic status academic history familial history developmental data work samples curriculum based data intervention results, etc. the C-LIM represents the only current method for evaluating test score validity and decreasing the potential for biased and discriminatory interpretation. When integrated within any assessment, the C-LIM facilitates evidence-based assessment and along with other assessment data, assists practitioners in answering the most basic question in EL assessment: Are the students observed learning problems due primarily to cultural or linguistic differences or disorder? Assessment and Related Resources TESTS: Ortiz Picture Vocabulary Acquisition Test (Ortiz PVAT) https://www.mhs.com/ortizpvat BOOKS: Rhodes, R., Ochoa, S. H. & Ortiz, S. O. (2005). Comprehensive Assessment of Culturally and Linguistically Diverse Students: A practical approach. New York: Guilford.

Flanagan, D. P., Ortiz, S.O. & Alfonso, V.C. (2013). Essentials of Cross-Battery Assessment, Third Edition. New York: Wiley & Sons, Inc. Flanagan, D.P. & Ortiz, S.O. (2012). Essentials of Specific Learning Disability Identification. New York: Wiley & Sons, Inc. Ortiz, S. O., Flanagan, D. P. & Alfonso, V. C. (2015). Cross-Battery Assessment Software System (X-BASS v1.4). New York: Wiley & Sons, Inc. ONLINE: Competency-based XBA Certification Program https://www.schoolneuropsych.com/xba/ CHC Cross-Battery Online http://www.crossbattery.com/ Free C-LIM Resources http://facpub.stjohns.edu/~ortizs/CLIM/index.html For a free 30-day trial and two free uses: http://info.mhs.com/OrtizPVATfreetrial Free webinar now available at: https://www.youtube.com/watch?v=EjUj0j_NIrQ For additional information, visit: http://www.mhs.com/ortizpvat

Recently Viewed Presentations

  • Newton&#x27;s Laws

    Newton's Laws

    Mass vs. Weight Mass is intrinsic property of any object Weight measures gravitational force on an object, usually due to a planet Weight depends on location of object Question 1: How does mass of a rock compare when on Earth...
  • Juniper Networks Corporate PowerPoint Template

    Juniper Networks Corporate PowerPoint Template

    vRouter performs networking services like Security Policies, NAT, Multicast, Mirroring, and Load Balancing. No need for Service Nodes or L2/L3 Gateways for Routing, Broadcast/Multicast, NAT . Routes are automatically leaked into the VRF based on Policies. Support for Multiple Interfaces...
  • Biohazardous Waste Reduction Training

    Biohazardous Waste Reduction Training

    The mission of the committee is to reduce biohazardous waste on the UNMC/Nebraska Medical Center campus (and its affiliates). One of the first tasks addressed by the Biohazardous Waste Committee was to reduce confusion regarding biohazardous waste by creating one...
  • Dual Frequency SBAS Trial and Preliminary Results

    Dual Frequency SBAS Trial and Preliminary Results

    SBAS IWG/26 New Delhi, India Feb. 5-7, 2014 Dual Frequency SBAS Trial and Preliminary Results (Work Plan: Identify Benefits) Takeyasu Sakai Electronic Navigation Research Institute, Japan
  • Women`s liberation and the Environmental movement

    Women`s liberation and the Environmental movement

    thesis. Anti-War protests and the civil rights movement created the backdrop for activist groups (women and environmentalist groups) to express grievances and initiate social reform.
  • Chapter 8 Feedback Controllers Figure 8.1 Schematic diagram

    Chapter 8 Feedback Controllers Figure 8.1 Schematic diagram

    Reset Windup An inherent disadvantage of integral control action is a phenomenon known as reset windup or integral windup. Recall that the integral mode causes the controller output to change as long as e(t*) ≠ 0 in Eq. 8-8.
  • Tutor.com Learning Suite

    Tutor.com Learning Suite

    Math Questions. Algebra through Calculus. ... PowerPoint, TXT, HTM, HTLM, PDF, JPG, BMP, PNG (1 MB, max) Tutor shares application so students can see software tools. Excellent for teaching MS Office. Replaces copy/paste of text on whiteboard. Allows both student...
  • An Introduction to Poetry Terms and Types

    An Introduction to Poetry Terms and Types

    What is Poetry? Poetry is arguably the purest form of writing.Poetry is a sense of the beautiful. It is art. Like art it is very difficult to define because it is an expression of what the poet thinks and feels...