Glossary

Accreditation: The granting of recognition of a test or an examination, usually by an official body such as a government department, examinations board, etc.

Aggregate: To combine two or more related scores into one total score.

Alignment: The process of linking content and performance standards to assessment, instruction, and learning in classrooms. One typical alignment strategy is the step-by-step development of (a) content standards, (b) performance standards, (c) assessments, and (d) instruction for classroom learning.

Assessment grid: A set of assessment criteria presented in a tabular format.

Benchmark: A detailed, validated description of a specific level of student performance expected of students at particular ages, grades, or levels in their development. Benchmarks are often represented by samples of student work.

Bias: A test or item can be considered to be biased if one particular section of the candidate population is advantaged or disadvantaged by some feature of the test or item which is not relevant to what is being measured. Sources of bias may be connected with gender, age, culture, etc.

Borderline performance: A level of knowledge and skills that is just barely acceptable for entry into a performance level (e.g., B2-level).

Classical test theory (CTT): CTT refers to a body of statistical models for test data. The basic notion of CTT is that the observed score X obtained when a person p is administered form f of test X, is the sum of a true-score component and an error component. See also Item Response Theory (IRT).

Compensatory strategy: A strategy that allows a high level of competence in one of the components of the assessment to compensate for a low level of the other components.

Conjunctive strategy: A strategy that requires attaining some predefined minimum level of competence for each one of the separate components to allow the final, summarized result to be judged as acceptable (sufficient).

Construct: A hypothesized ability or mental trait which cannot necessarily be directly observed or measured; for example, in language testing, listening ability.

Content standards: Broadly stated expectations of what students should know and be able to do in particular subjects and grade levels.

Content validity: A test is said to have content validity if the items or tasks of which it is made up constitute a representative sample of items or tasks for the area of knowledge or ability to be tested.

Constructed response (CR): A form of written response to a test item that involves active production, rather than just choosing from a number of options.

Cross-language standard setting: A method intended to verify that examinations in different languages are linked in a comparable way to the common standards.

Cross validation: The application of a scoring system derived in one sample to a different sample drawn from the same population.

Cut score (cut-off score): The minimum score a candidate has to achieve in order to be assigned to a given level or grade in a test or an examination.

Decision validity: The degree to which classification decisions will be identical in repeated testing with the same examinees.

Direct test: A test which measures the productive skills of speaking or writing, in which performance of the skills itself is directly measured.

Examinee-centred method: A standard setting method in which someone who knows examinees well provides a holistic assessment of the level of their language proficiency, for example a CEFR level.

External validation: Collecting evidence from independent sources which corroborate the results and conclusions of procedures used.

Familiarisation: Tasks to ensure that all those who will be involved in the process of relating an examination to the CEFR have an in-depth knowledge of it.

High stakes testing: A form of testing with important consequences for test takers.

Holistic judgment: Evaluating student work in which the score is based on an overall judgment of student performance rather than on specific separate criteria.

Indirect test: A test or task which attempts to measure the abilities underlying a language skill, rather than testing performance of the skill itself. An example is testing writing ability by requiring the candidate to mark structures used incorrectly in a text.

Internal validation: The process of finding out the accuracy and the consistency of an assessment based on the judgments in the test.

Illustrative samples (benchmarked samples): Examples of student performance that have been validated to represent a certain level of performance.

Inter-rater reliability: The degree to which different raters agree in their assessment of candidates’ performance.

Intra-rater reliability: The degree to which the same rater judges the same performance similarly on different occasions.

Item Response Theory (IRT): It is a theoretical approach to relating student ability to test data; it focuses on the items as opposed to classical test theory (CTT) focusing on the test scores.

Item difficulty: In classical test theory, the difficulty of an item is the proportion of candidates responding to it correctly. In IRT it is an estimate of a difficulty of an item calculated independently of the population.

Judge: Someone who assigns a score to a candidate’s performance in a test, using judgment to do so.

KR20: A measure of internal consistency developed by Kuder and Richardson and used to estimate test reliability.

Logistic regression: A statistical technique that yields a formula for translating one of more pieces of information (e.g., a person’s test score) into the estimated probability of a specified event (e.g., a sample of the student’s work being judged as proficient).

Low stakes testing: A form of testing with less important consequences for test takers. The Manual: The document produced by the Council of Europe to provide guidance to link tests and examinations to the CEFR.

Mastery: The indication that the student has met a set of criteria, defined in terms of well-defined domains of skills or knowledge.

Panel: A group of judges.

Panellist: A member of a group of judges.

Performance standards: Explicit definitions of what students must do to demonstrate proficiency at a specific level on the content standards.

Performance level descriptors (PLD): Descriptions of standards the students should have reached. The level descriptions in the CEFR are examples of PLDs.

Piloting: A preliminary study through which test developers try out tasks on a limited number of subjects in order to locate problems before launching a full-scale trial.

Pretesting: A stage in the development of test materials at which items are tried out with representative samples from the target population in order to determine their difficulty. Following statistical analysis, those items that are considered satisfactory can be used in live tests.

Procedural validation: Collecting evidence that proper procedures have been used during the different stages of standard setting.

Rater: A person who evaluates or judges student performance on an assessment against specific criteria.

Rating: The process of assigning a score to performance in a test through the exercise of judgement.

Response probability (RP): In standard setting it is a mastery criterion. In many tests, it is set at two-thirds of the maximum score, although some authors prefer to set it at 50%, while others at 80%.

Specification: A stage in the linking process that deals with the content analysis of an examination or test in order to relate it to the CEFR from the point of view of coverage.

Test-centred methods: A set of methods where judges estimate, for example at what level a test taker can be expected to respond correctly to a set of items.

Test equating: The process of comparing the difficulty of two or more forms of a test, in order to establish their equivalence.

Test specifications: A description of the characteristics of an examination, including what is tested, how it is tested, details such as number and length of papers, item types used, etc.

Transparency: Implies openness, communication and accountability. It is an extension of the meaning used in the physical sense (cf. a transparent object can be seen through).

Glossary

Accreditation: The granting of recognition of a test or an examination, usually by an official body such as a government department, examinations board, etc.

Aggregate: To combine two or more related scores into one total score.

Alignment: The process of linking content and performance standards to assessment, instruction, and learning in classrooms. One typical alignment strategy is the step-by-step development of (a) content standards, (b) performance standards, (c) assessments, and (d) instruction for classroom learning.

Assessment grid: A set of assessment criteria presented in a tabular format.

Benchmark: A detailed, validated description of a specific level of student performance expected of students at particular ages, grades, or levels in their development. Benchmarks are often represented by samples of student work.

Bias: A test or item can be considered to be biased if one particular section of the candidate population is advantaged or disadvantaged by some feature of the test or item which is not relevant to what is being measured. Sources of bias may be connected with gender, age, culture, etc.

Borderline performance: A level of knowledge and skills that is just barely acceptable for entry into a performance level (e.g., B2-level).

Classical test theory (CTT): CTT refers to a body of statistical models for test data. The basic notion of CTT is that the observed score X obtained when a person p is administered form f of test X, is the sum of a true-score component and an error component. See also Item Response Theory (IRT).

Compensatory strategy: A strategy that allows a high level of competence in one of the components of the assessment to compensate for a low level of the other components.

Conjunctive strategy: A strategy that requires attaining some predefined minimum level of competence for each one of the separate components to allow the final, summarized result to be judged as acceptable (sufficient).

Construct: A hypothesized ability or mental trait which cannot necessarily be directly observed or measured; for example, in language testing, listening ability.

Content standards: Broadly stated expectations of what students should know and be able to do in particular subjects and grade levels.

Content validity: A test is said to have content validity if the items or tasks of which it is made up constitute a representative sample of items or tasks for the area of knowledge or ability to be tested.

Constructed response (CR): A form of written response to a test item that involves active production, rather than just choosing from a number of options.

Cross-language standard setting: A method intended to verify that examinations in different languages are linked in a comparable way to the common standards.

Cross validation: The application of a scoring system derived in one sample to a different sample drawn from the same population.

Cut score (cut-off score): The minimum score a candidate has to achieve in order to be assigned to a given level or grade in a test or an examination.

Decision validity: The degree to which classification decisions will be identical in repeated testing with the same examinees.

Direct test: A test which measures the productive skills of speaking or writing, in which performance of the skills itself is directly measured.

Examinee-centred method: A standard setting method in which someone who knows examinees well provides a holistic assessment of the level of their language proficiency, for example a CEFR level.

External validation: Collecting evidence from independent sources which corroborate the results and conclusions of procedures used.

Familiarisation: Tasks to ensure that all those who will be involved in the process of relating an examination to the CEFR have an in-depth knowledge of it.

High stakes testing: A form of testing with important consequences for test takers.

Holistic judgment: Evaluating student work in which the score is based on an overall judgment of student performance rather than on specific separate criteria.

Indirect test: A test or task which attempts to measure the abilities underlying a language skill, rather than testing performance of the skill itself. An example is testing writing ability by requiring the candidate to mark structures used incorrectly in a text.

Internal validation: The process of finding out the accuracy and the consistency of an assessment based on the judgments in the test.

Illustrative samples (benchmarked samples): Examples of student performance that have been validated to represent a certain level of performance.

Inter-rater reliability: The degree to which different raters agree in their assessment of candidates’ performance.

Intra-rater reliability: The degree to which the same rater judges the same performance similarly on different occasions.

Item Response Theory (IRT): It is a theoretical approach to relating student ability to test data; it focuses on the items as opposed to classical test theory (CTT) focusing on the test scores.

Item difficulty: In classical test theory, the difficulty of an item is the proportion of candidates responding to it correctly. In IRT it is an estimate of a difficulty of an item calculated independently of the population.

Judge: Someone who assigns a score to a candidate’s performance in a test, using judgment to do so.

KR20: A measure of internal consistency developed by Kuder and Richardson and used to estimate test reliability.

Logistic regression: A statistical technique that yields a formula for translating one of more pieces of information (e.g., a person’s test score) into the estimated probability of a specified event (e.g., a sample of the student’s work being judged as proficient).

Low stakes testing: A form of testing with less important consequences for test takers. The Manual: The document produced by the Council of Europe to provide guidance to link tests and examinations to the CEFR.

Mastery: The indication that the student has met a set of criteria, defined in terms of well-defined domains of skills or knowledge.

Panel: A group of judges.

Panellist: A member of a group of judges.

Performance standards: Explicit definitions of what students must do to demonstrate proficiency at a specific level on the content standards.

Performance level descriptors (PLD): Descriptions of standards the students should have reached. The level descriptions in the CEFR are examples of PLDs.

Piloting: A preliminary study through which test developers try out tasks on a limited number of subjects in order to locate problems before launching a full-scale trial.

Pretesting: A stage in the development of test materials at which items are tried out with representative samples from the target population in order to determine their difficulty. Following statistical analysis, those items that are considered satisfactory can be used in live tests.

Procedural validation: Collecting evidence that proper procedures have been used during the different stages of standard setting.

Rater: A person who evaluates or judges student performance on an assessment against specific criteria.

Rating: The process of assigning a score to performance in a test through the exercise of judgement.

Response probability (RP): In standard setting it is a mastery criterion. In many tests, it is set at two-thirds of the maximum score, although some authors prefer to set it at 50%, while others at 80%.

Specification: A stage in the linking process that deals with the content analysis of an examination or test in order to relate it to the CEFR from the point of view of coverage.

Test-centred methods: A set of methods where judges estimate, for example at what level a test taker can be expected to respond correctly to a set of items.

Test equating: The process of comparing the difficulty of two or more forms of a test, in order to establish their equivalence.

Test specifications: A description of the characteristics of an examination, including what is tested, how it is tested, details such as number and length of papers, item types used, etc.

Transparency: Implies openness, communication and accountability. It is an extension of the meaning used in the physical sense (cf. a transparent object can be seen through).

  Search

Bookmark and Share

Bookmark and Share