Reliability and Validity of Play-Based Assessments of Motor and Cognitive Skills for Infants and Young Children: A Systematic Review

1 M.G. O'Grady, PT, MS, Rehabilitation and Movement Science Program, Virginia Commonwealth University, Richmond, Virginia.

Search for other works by this author on: Stacey C. Dusing Stacey C. Dusing

2 S.C. Dusing, PT, PhD, PCS, Motor Development Laboratory, Department of Physical Therapy, Virginia Commonwealth University, and Department of Pediatrics, Children's Hospital of Richmond, Virginia Commonwealth University, PO Box 980224, Richmond, VA 23298-0224 (USA).

* Address all correspondence to Dr Dusing. Search for other works by this author on:

Both authors provided concept/idea/research design, writing, data collection and analysis, and project management. Mr O'Grady provided consultation (including review of the manuscript before submission).

Physical Therapy, Volume 95, Issue 1, 1 January 2015, Pages 25–38, https://doi.org/10.2522/ptj.20140111

01 January 2015 06 March 2014 14 August 2014 01 January 2015

Cite

Michael G. O'Grady, Stacey C. Dusing, Reliability and Validity of Play-Based Assessments of Motor and Cognitive Skills for Infants and Young Children: A Systematic Review, Physical Therapy, Volume 95, Issue 1, 1 January 2015, Pages 25–38, https://doi.org/10.2522/ptj.20140111

Navbar Search Filter Mobile Enter search term Search Navbar Search Filter Enter search term Search Background

Play is vital for development. Infants and children learn through play. Traditional standardized developmental tests measure whether a child performs individual skills within controlled environments. Play-based assessments can measure skill performance during natural, child-driven play.

The purpose of this study was to systematically review reliability, validity, and responsiveness of all play-based assessments that quantify motor and cognitive skills in children from birth to 36 months of age.

Data Sources

Studies were identified from a literature search using PubMed, ERIC, CINAHL, and PsycINFO databases and the reference lists of included papers.

Study Selection

Included studies investigated reliability, validity, or responsiveness of play-based assessments that measured motor and cognitive skills for children to 36 months of age.

Data Extraction

Two reviewers independently screened 40 studies for eligibility and inclusion. The reviewers independently extracted reliability, validity, and responsiveness data. They examined measurement properties and methodological quality of the included studies.

Data Synthesis

Four current play-based assessment tools were identified in 8 included studies. Each play-based assessment tool measured motor and cognitive skills in a different way during play. Interrater reliability correlations ranged from .86 to .98 for motor development and from .23 to .90 for cognitive development. Test-retest reliability correlations ranged from .88 to .95 for motor development and from .45 to .91 for cognitive development. Structural validity correlations ranged from .62 to .90 for motor development and from .42 to .93 for cognitive development. One study assessed responsiveness to change in motor development.

Limitations

Most studies had small and poorly described samples. Lack of transparency in data management and statistical analysis was common.

Conclusions

Play-based assessments have potential to be reliable and valid tools to assess cognitive and motor skills, but higher-quality research is needed. Psychometric properties should be considered for each play-based assessment before it is used in clinical and research practice.

Play provides infants and young children with the ability to practice skills and support all domains of development: motor, cognitive, social-emotional, communication, and adaptive. 1– 4 Play has been variably defined in the literature given different disciplines and reasons for assessing development through play. 5 In this systematic review, play is defined as a pleasurable, active, self-motivated developmental phenomenon 1, 6 by which infants and young children learn about the world through interactions with objects and people. 5, 7

Play fosters both motor and cognitive development. 1, 2, 7 Play is common to all infants, and it is a primary arena within which domain-specific and global aspects of development occur. 1, 8, 9 Early play helps to prepare infants and young children to learn in school. 10 Children learn through the repetition of behaviors during play within typical environments and routines. 11

Play is the basis for many developmental interventions used with children with disabilities. 12 Play, however, is often not a part of traditional standardized developmental tests used by pediatric physical therapists and other early intervention providers to determine the need for intervention or the efficacy of intervention. 13 Traditional standardized developmental assessments typically involve a child performing a specific task within a controlled environment that is outside the context of everyday routines. 13 Some assessments require the examiner to elicit behaviors by altering the context or moving the child. 14, 15 Behaviors assessed in this way are not authentic child-directed behaviors, and the child may not perform optimally. 16 Furthermore, traditional standardized developmental assessments are designed to determine whether a child can perform a specific skill, not whether the child performs the skills in his or her normal routine. 15

Play-based assessments are standardized measures designed to quantify changes in one or more of the 5 developmental domains during self-motivated, child-driven play. 14, 17, 18 Some literature suggests that play-based assessments may be an effective and efficient means of assessing a child's developmental level, 19 evaluating change over time, and evaluating the efficacy of intervention. 18, 20 Play-based assessments are often adjuncts to other assessment procedures, 21, 22 although some authors argue that they also can serve as a basis for discriminative decisions and planning. 15, 17, 18, 23 In this review, play-based assessment is differentiated from an assessment of play, which interprets the type of play in which a child is engaged relative to a hierarchical developmental theory of play. 18 Assessments of play are not discussed in this review.

Play-based assessments focus on child-directed activities. During play-based assessment, the child directs the interaction and experience, increasing the likelihood of observing behaviors that the child typically performs. 24 This assessment results in a rich description of a child's domain-specific strengths and weaknesses. 14 Using the arena of play provides the practitioner with not only the ability to assess current skills but also the added benefit of previewing emerging skills in a functional context. 2, 25 Play-based assessments add authenticity and contextual benefits to the assessment of motor and cognitive development because they measure objective behaviors during child-driven activities within a normal environment. This approach allows examination of cross-domain relationships by integrating findings. 24

Play-based assessments can be contrasted with traditional standardized assessments. First, play-based assessment takes place within a naturalistic environment and context, whereas traditional standardized developmental tests require specific responses to an examiner-provided stimulus. 14 Second, play-based assessments typically quantify if and how often a child performs specific types of skills during a naturalistic observation rather than just assessing if the child can perform the skill. 14, 21, 26 Third, these assessments are child-driven 14 rather than examiner-driven, giving the practitioner insight into the child's ability to explore and learn. 6 Fourth, play-based assessments can document limitations commonly seen in children with developmental delays such as decreased attention to toys, using fewer toys and less variety of active play skills, and being more passive during play. 6

Although the theoretical value of play-based assessments is clear, the reliability and validity of play-based assessments need to be considered before they are used in clinical practice or research. The first aim of this systematic review is to determine the interrater and test-retest reliability of play-based assessments of motor and cognitive skills for infants and children aged 0 to 36 months. The second aim is to identify the content and structural validity of play-based assessments of motor and cognitive skills for infants and children aged 0 to 36 months, as well as the responsiveness of these measures.

This article focuses on the assessment of infants and toddlers, from birth to 36 months of age, who, based on their age, could be eligible in the United States for early intervention services under Part C of the Individuals With Disabilities Education Improvement Act (IDEIA). 27 Play-based assessments allow for assessment in a variety of cultures and countries and at a variety of ages. As in many countries, the goal of providing intervention in young children in the United States is to support early development and improve readiness to learn in children with or at risk for developmental delays. Intervention programs with similar goals around the world may find play-based assessments an option for assessing the needs and progress of children if these tools are reliable and valid.

The results of this study provide information on the reliability, validity, and responsiveness of play-based assessments. This information may help to determine if play-based assessments can be used for research and clinical purposes. In addition, this information will help clinicians to determine which play-based assessments are best to supplement traditional standardized developmental tests that are currently used to evaluate the need for and efficacy of early developmental intervention services.

Method

Search criteria were developed to identify studies that met inclusion and exclusion criteria specified prior to the study. Studies were required to evaluate one or more of the following measurement properties of a play-based assessment of motor and cognitive skills: interrater reliability, test-retest reliability, structural validity, content validity, and responsiveness to change over time. Participants' ages were fully or partially 0 to 36 months. Participants could have a diagnosed disability or delay or could be developing typically. Studies that did not include play-based assessment of motor and cognitive skills, did not include children from birth to 36 months of age, were not available in English, or were a review of previous research or theory without new data were excluded.

Data Sources and Searches

A literature search was performed using the PubMed interface from MEDLINE (late 1940s–May 2013), ERIC (1966–May 2013), CINAHL (1937–May 2013), and PsycINFO (1894–May 2013). Search terms were developed with the help of a research librarian using MeSH headings, key words, and phrases. Terms were purposefully broad to capture all publications that met the inclusion criteria for this systematic review. The full search strategy is described in the Appendix.

Study Selection

Consistent with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement, 28 results from the literature search were reviewed for duplicates prior to screening for inclusion. The title and abstract of all identified publications were screened using the inclusion and exclusion criteria. Any publication that clearly did not meet the exclusion criteria was moved to the screening process. During screening, 2 reviewers independently reviewed the full-text publication to determine eligibility for the systematic review. Any question of inclusion between the 2 reviewers during eligibility was resolved through discussion. The bibliographies of included papers also were reviewed by both authors to determine if additional studies warranted inclusion.

Data Extraction and Quality Assessment

The interrater and test-retest reliability, content and structural validity, and responsiveness data for each included study were extracted independently by each reviewer using data collection forms developed for this systematic review. Any discrepancy in the extracted reliability or validity data was discussed between reviewers, and a consensus was reached. No statistical analysis or meta-analysis was conducted given the limited number of included studies. A priori, a correlation ranging from .00 to .50 was considered weak, .50 to .75 was considered moderate, and .75 to 1.00 was considered to be strong. 29 The strength of the correlation presented as a measure of reliability or validity was used to categorize the degree of reliability and validity documented for each play-based assessment and for the group of play-based assessments. Therefore, the terms “weak,” “moderate,” and “strong” are used to describe the results for reliability and validity of each included paper.

The COnsensus-based Standards for the selection of health status Measurement INstruments (COSMIN) was used as a measure of methodological quality of the measurement properties. 30, 31 The 5 measurement properties assessed for this study were defined by the COSMIN. 32 Interrater reliability is a measure of whether different raters can score the same testing occasion and obtain the same score. 32 Test-retest reliability is the extent to which the scores for patients who have not changed are the same for repeated measurements over time. 32 Structural validity is the degree to which scores of a health-related instrument adequately reflect the same construct as a validated assessment. 32 Content validity is a judgment about whether the content of a test adequately reflects the construct to be measured. 32 Responsiveness is the ability of the measurement tool to measure change over time in the focal construct. 32

The COSMIN can be used to measure methodological quality with a 3-step process. First, the measurement properties assessed in the paper are identified. Second, reviewers score each measurement property. Each measurement property on the COSMIN has a rating box containing 5 to 18 individual items specific to that measurement property. Each item within the rating box is scored based on specific scoring criteria: from 1 to 4 possible answers representing excellent, good, fair, and poor quality for that item. An item is scored as excellent when there is adequate evidence provided for that item. When information is not provided but it is reasonable to assume information regarding that item, the item is rated as good. Fair indicates that methodological quality for the item is doubtful, whereas poor is scored when there is evidence that the methodological quality pertaining to a specific item is inadequate. For example, clear evidence of patient stability between test and retest in a study receives a score of excellent on that item in the rating box for the measurement property test-retest reliability. If it was unclear if patients were stable during the time between test and retest, however, that item is marked as fair. The third step to score each rating box of the COSMIN involves determining the overall rating of methodological quality of each measurement property. This overall rating is determined by the lowest score for all items in the rating box for that measurement property. 33 For example, if all responses in the test-retest rating box are excellent except for one judged to be fair, the quality of the test-retest reliability measurement of that paper is considered to be fair.

Each author of this systematic review independently rated the methodological quality of each included study using the COSMIN. Any discrepancies in scoring between the raters that resulted in different measurement properties being scored, a change in the overall rating of methodological quality of any measurement property, or a discrepancy of 2 or more ordinal levels for any single item within a measurement property rating box were discussed, and a consensus was reached. The overall methodological quality for each measurement property included in a paper was recorded and is reported in this systematic review.

Results

The titles and abstracts of 2,133 studies were screened for possible inclusion. Forty studies could not be excluded during screening and were reviewed in full text for eligibility. Eight of these studies matched the inclusion criteria and were included in the systematic review, whereas 32 studies were excluded ( Figure).

PRISMA diagram.

Studies including 4 separate play-based assessments currently available for commercial use were included in this systematic review: Play in Early Childhood Evaluation System (PIECES); Transdisciplinary Play-Based Assessment, 2nd edition (TPBA-2); Assessment, Evaluation, and Programming System, 2nd edition (AEPS); and the Individual Growth and Development Indicators (IGDI). Related assessments or precursors to these play-based assessments also were identified in the literature. The Play-Based Assessment (PBA) 16 was the initial form of the PIECES. 23 The PBA also was the cognitive portion of the Transdisciplinary Play-Based Assessment (TPBA), 14 which was never tested for reliability or validity with a young population. The TPBA is the previous version of the TPBA-2. 19 Psychometric properties of the Evaluation and Programming System for Infants and Young Children (EPS-I) 34 were presented by Bailey and Bricker 35 and Bricker et al. 36 The EPS-I is the predecessor of the AEPS. 37 Part of the AEPS was used for the experimental Assessment, Evaluation, and Programming System for Eligibility (AEPS:E) as reviewed herein. 20 Two other play-based assessments, a general outcome measure of growth in movement for infants and toddlers 21 and the Early Problem Solving Indicator (EPSI), 26 met the inclusion criteria. These last 2 assessments are predecessors to the movement and cognitive sections of the IGDI 38 : the Early Movement Indicator (EMI-IGDI) and the Early Problem Solving Indicator (EPSI-IGDI), respectively ( Tab. 1).

Description of Play-Based Assessments Included in This Systematic Review a

Play-Based Assessments . Motor/Cognitive . No. of Items . Item Types . Places and Items for Assessment . Purpose of the Assessment .
PBA 16 prototypeCognitiveNot availableCognitive domain of TPBA (similar in structure to TPBA-2)Neonatal intensive care unit follow-up clinic at a large hospitalAssessing cognition to determine eligibility for early intervention services
PIECES 23,47 Cognitive13 core play behaviors, 86 total items; coded via observationCognitive items grouped into subdomains along a developmental continuum consisting of exploratory and pretend play behaviorsCan be conducted any place in which the child feels comfortable; behaviors during spontaneous unstructured play without adult guidance for a minimum of 30 min; toys arranged in the testing room according to general themes such as kitchen, blocks, etcDetermine highest level of play behavior so that educators can develop interventions to facilitate higher levels of play
TPBA-2 19,48 Motor and cognitive118 items across all domains and subcategories4 developmental domains: cognitive, motor, communication, and social-emotional; play skills listed in a developmental sequence to attain age equivalence scores for age-equivalent comparisons with norm-referenced measuresInformal play setting with manipulatives, representational toys, art materials, construction and play objects, and gross motor equipment
Parent and professionals from 3 or more different backgrounds–usually speech-language pathologist, occupational therapist, physical therapist, teacher, and psychologist
Assess developmental skills and process, as well as interaction patterns and learning styles
EPS-I 35,36 prototypeMotor and cognitive164 items, criterion-referenced and curriculum-based6 domains (including cognitive and 2 motor domains: gross and fine motor); each item scored as pass, inconsistent, or fail; functional skillsClassroom observation during routine activities, including play, activity groups, and snackDevelop systematic methods to plan and evaluate early intervention practices in order to monitor and demonstrate the efficacy of intervention
AEPS:E, 20,49 experimental, part of currently available AEPSMotor and cognitiveAEPS contains 249 items across 6 developmental areasCurriculum-based; 5 activities using AEPS items with scripted procedures and standard materials (with some flexibility to accommodate individual child routines); each item scored as does not pass, inconsistent performance, or passes consistentlyHome, community- based setting with familiar activities and materials and people; parent and caregiver involvementAuthentic assessment of observed behaviors/skills that links assessment outcomes to goal development and planning; AEPS is appropriate for a broad range of needs and diagnoses in children aged 1 mo to 3 y
EMI-IGDI 21,38 Motor5 key skill elements: position transition, grounded locomotion, vertical locomotion, throw/roll, and catch/ trapKey skill elements represent postural control, locomotion, and object control; each skill is coded for frequency; prespecified toys presented for exactly 6 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions
EPSI-IGDI 26,38 Cognitive4 key skill elements: look, explore, function, and solutionKey skill elements represent visual, object exploration, and problem solving; each skill is coded for frequency; 3 prespecified toys presented, each for exactly 2 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions
Play-Based Assessments . Motor/Cognitive . No. of Items . Item Types . Places and Items for Assessment . Purpose of the Assessment .
PBA 16 prototypeCognitiveNot availableCognitive domain of TPBA (similar in structure to TPBA-2)Neonatal intensive care unit follow-up clinic at a large hospitalAssessing cognition to determine eligibility for early intervention services
PIECES 23,47 Cognitive13 core play behaviors, 86 total items; coded via observationCognitive items grouped into subdomains along a developmental continuum consisting of exploratory and pretend play behaviorsCan be conducted any place in which the child feels comfortable; behaviors during spontaneous unstructured play without adult guidance for a minimum of 30 min; toys arranged in the testing room according to general themes such as kitchen, blocks, etcDetermine highest level of play behavior so that educators can develop interventions to facilitate higher levels of play
TPBA-2 19,48 Motor and cognitive118 items across all domains and subcategories4 developmental domains: cognitive, motor, communication, and social-emotional; play skills listed in a developmental sequence to attain age equivalence scores for age-equivalent comparisons with norm-referenced measuresInformal play setting with manipulatives, representational toys, art materials, construction and play objects, and gross motor equipment
Parent and professionals from 3 or more different backgrounds–usually speech-language pathologist, occupational therapist, physical therapist, teacher, and psychologist
Assess developmental skills and process, as well as interaction patterns and learning styles
EPS-I 35,36 prototypeMotor and cognitive164 items, criterion-referenced and curriculum-based6 domains (including cognitive and 2 motor domains: gross and fine motor); each item scored as pass, inconsistent, or fail; functional skillsClassroom observation during routine activities, including play, activity groups, and snackDevelop systematic methods to plan and evaluate early intervention practices in order to monitor and demonstrate the efficacy of intervention
AEPS:E, 20,49 experimental, part of currently available AEPSMotor and cognitiveAEPS contains 249 items across 6 developmental areasCurriculum-based; 5 activities using AEPS items with scripted procedures and standard materials (with some flexibility to accommodate individual child routines); each item scored as does not pass, inconsistent performance, or passes consistentlyHome, community- based setting with familiar activities and materials and people; parent and caregiver involvementAuthentic assessment of observed behaviors/skills that links assessment outcomes to goal development and planning; AEPS is appropriate for a broad range of needs and diagnoses in children aged 1 mo to 3 y
EMI-IGDI 21,38 Motor5 key skill elements: position transition, grounded locomotion, vertical locomotion, throw/roll, and catch/ trapKey skill elements represent postural control, locomotion, and object control; each skill is coded for frequency; prespecified toys presented for exactly 6 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions
EPSI-IGDI 26,38 Cognitive4 key skill elements: look, explore, function, and solutionKey skill elements represent visual, object exploration, and problem solving; each skill is coded for frequency; 3 prespecified toys presented, each for exactly 2 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions

Some of the tests have 2 references, as 2 papers were published documenting the measurement properties of interest. PBA=Play-Based Assessment; PIECES=Play in Early Childhood Evaluation System; TPBA=Transdisciplinary Play-Based Assessment; TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS=Assessment, Evaluation, and Proramming System, 2nd edition; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator–Individual Growth and Development Indicators, a general outcome measure of movement EPSI-IGDI: Early Problem Solving Indicator–Individual Growth and Development Indicators. The EMIIGDI and EPSI-IGDI are each independent parts of the larger assessment tool, the Individual Growth and Development Indicators (IGDI).

Description of Play-Based Assessments Included in This Systematic Review a

Play-Based Assessments . Motor/Cognitive . No. of Items . Item Types . Places and Items for Assessment . Purpose of the Assessment .
PBA 16 prototypeCognitiveNot availableCognitive domain of TPBA (similar in structure to TPBA-2)Neonatal intensive care unit follow-up clinic at a large hospitalAssessing cognition to determine eligibility for early intervention services
PIECES 23,47 Cognitive13 core play behaviors, 86 total items; coded via observationCognitive items grouped into subdomains along a developmental continuum consisting of exploratory and pretend play behaviorsCan be conducted any place in which the child feels comfortable; behaviors during spontaneous unstructured play without adult guidance for a minimum of 30 min; toys arranged in the testing room according to general themes such as kitchen, blocks, etcDetermine highest level of play behavior so that educators can develop interventions to facilitate higher levels of play
TPBA-2 19,48 Motor and cognitive118 items across all domains and subcategories4 developmental domains: cognitive, motor, communication, and social-emotional; play skills listed in a developmental sequence to attain age equivalence scores for age-equivalent comparisons with norm-referenced measuresInformal play setting with manipulatives, representational toys, art materials, construction and play objects, and gross motor equipment
Parent and professionals from 3 or more different backgrounds–usually speech-language pathologist, occupational therapist, physical therapist, teacher, and psychologist
Assess developmental skills and process, as well as interaction patterns and learning styles
EPS-I 35,36 prototypeMotor and cognitive164 items, criterion-referenced and curriculum-based6 domains (including cognitive and 2 motor domains: gross and fine motor); each item scored as pass, inconsistent, or fail; functional skillsClassroom observation during routine activities, including play, activity groups, and snackDevelop systematic methods to plan and evaluate early intervention practices in order to monitor and demonstrate the efficacy of intervention
AEPS:E, 20,49 experimental, part of currently available AEPSMotor and cognitiveAEPS contains 249 items across 6 developmental areasCurriculum-based; 5 activities using AEPS items with scripted procedures and standard materials (with some flexibility to accommodate individual child routines); each item scored as does not pass, inconsistent performance, or passes consistentlyHome, community- based setting with familiar activities and materials and people; parent and caregiver involvementAuthentic assessment of observed behaviors/skills that links assessment outcomes to goal development and planning; AEPS is appropriate for a broad range of needs and diagnoses in children aged 1 mo to 3 y
EMI-IGDI 21,38 Motor5 key skill elements: position transition, grounded locomotion, vertical locomotion, throw/roll, and catch/ trapKey skill elements represent postural control, locomotion, and object control; each skill is coded for frequency; prespecified toys presented for exactly 6 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions
EPSI-IGDI 26,38 Cognitive4 key skill elements: look, explore, function, and solutionKey skill elements represent visual, object exploration, and problem solving; each skill is coded for frequency; 3 prespecified toys presented, each for exactly 2 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions
Play-Based Assessments . Motor/Cognitive . No. of Items . Item Types . Places and Items for Assessment . Purpose of the Assessment .
PBA 16 prototypeCognitiveNot availableCognitive domain of TPBA (similar in structure to TPBA-2)Neonatal intensive care unit follow-up clinic at a large hospitalAssessing cognition to determine eligibility for early intervention services
PIECES 23,47 Cognitive13 core play behaviors, 86 total items; coded via observationCognitive items grouped into subdomains along a developmental continuum consisting of exploratory and pretend play behaviorsCan be conducted any place in which the child feels comfortable; behaviors during spontaneous unstructured play without adult guidance for a minimum of 30 min; toys arranged in the testing room according to general themes such as kitchen, blocks, etcDetermine highest level of play behavior so that educators can develop interventions to facilitate higher levels of play
TPBA-2 19,48 Motor and cognitive118 items across all domains and subcategories4 developmental domains: cognitive, motor, communication, and social-emotional; play skills listed in a developmental sequence to attain age equivalence scores for age-equivalent comparisons with norm-referenced measuresInformal play setting with manipulatives, representational toys, art materials, construction and play objects, and gross motor equipment
Parent and professionals from 3 or more different backgrounds–usually speech-language pathologist, occupational therapist, physical therapist, teacher, and psychologist
Assess developmental skills and process, as well as interaction patterns and learning styles
EPS-I 35,36 prototypeMotor and cognitive164 items, criterion-referenced and curriculum-based6 domains (including cognitive and 2 motor domains: gross and fine motor); each item scored as pass, inconsistent, or fail; functional skillsClassroom observation during routine activities, including play, activity groups, and snackDevelop systematic methods to plan and evaluate early intervention practices in order to monitor and demonstrate the efficacy of intervention
AEPS:E, 20,49 experimental, part of currently available AEPSMotor and cognitiveAEPS contains 249 items across 6 developmental areasCurriculum-based; 5 activities using AEPS items with scripted procedures and standard materials (with some flexibility to accommodate individual child routines); each item scored as does not pass, inconsistent performance, or passes consistentlyHome, community- based setting with familiar activities and materials and people; parent and caregiver involvementAuthentic assessment of observed behaviors/skills that links assessment outcomes to goal development and planning; AEPS is appropriate for a broad range of needs and diagnoses in children aged 1 mo to 3 y
EMI-IGDI 21,38 Motor5 key skill elements: position transition, grounded locomotion, vertical locomotion, throw/roll, and catch/ trapKey skill elements represent postural control, locomotion, and object control; each skill is coded for frequency; prespecified toys presented for exactly 6 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions
EPSI-IGDI 26,38 Cognitive4 key skill elements: look, explore, function, and solutionKey skill elements represent visual, object exploration, and problem solving; each skill is coded for frequency; 3 prespecified toys presented, each for exactly 2 minTypical environment; administered by any early intervention professional trained with IGDIsMonitoring individual growth and making intervention decisions

Some of the tests have 2 references, as 2 papers were published documenting the measurement properties of interest. PBA=Play-Based Assessment; PIECES=Play in Early Childhood Evaluation System; TPBA=Transdisciplinary Play-Based Assessment; TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS=Assessment, Evaluation, and Proramming System, 2nd edition; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator–Individual Growth and Development Indicators, a general outcome measure of movement EPSI-IGDI: Early Problem Solving Indicator–Individual Growth and Development Indicators. The EMIIGDI and EPSI-IGDI are each independent parts of the larger assessment tool, the Individual Growth and Development Indicators (IGDI).

Interrater Reliability

Interrater reliability was measured in 5 studies. 20, 21, 26, 35, 36 The Pearson correlation coefficient for interrater reliability of motor assessments ranged from .86 to .98 ( Tab. 2). 20, 21, 35, 36 Interrater reliability of cognitive assessments ranged from .23 to .90 ( Tab. 3). 20, 26, 35, 36 One study of cognition had interrater reliability coefficients for individual skills but not for the aggregate of cognitive behaviors displayed. 26

Play-Based Assessments and Studies With Motor Psychometric Properties a

Play-Based Assessments . Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural Validity .
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 motor, r=.825
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4.4) [24–40]
r=.95 (total group)
r=.95 (atypical development)
r=.93 (total group)
r=.94 (atypical development)
Gesell scale gross motor test (developmental quotient):
r=.79 (total group)
r=.89 (atypical development)
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.96r=.95BSID motor age, r=.88
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
Gross motor, r=.86NAGesell scale gross motor test, r=.62
BDI, r=.65
EMI-IGDIGreenwood et al 21 N=29
24 typical, 5 atypical development
Age: 15.3 (9.6) [1–34]
r=.98r=.88PDMS-2 locomotor, r=.90, .86
Stationary, r=.80, .77
CAMS-GM, r=.85, .87
Range N=19–335 typical and atypical developmentr=.86–.98r=.88–.95r=.62 (Gesell scale)–.90 (PDMS-2)
Play-Based Assessments . Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural Validity .
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 motor, r=.825
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4.4) [24–40]
r=.95 (total group)
r=.95 (atypical development)
r=.93 (total group)
r=.94 (atypical development)
Gesell scale gross motor test (developmental quotient):
r=.79 (total group)
r=.89 (atypical development)
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.96r=.95BSID motor age, r=.88
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
Gross motor, r=.86NAGesell scale gross motor test, r=.62
BDI, r=.65
EMI-IGDIGreenwood et al 21 N=29
24 typical, 5 atypical development
Age: 15.3 (9.6) [1–34]
r=.98r=.88PDMS-2 locomotor, r=.90, .86
Stationary, r=.80, .77
CAMS-GM, r=.85, .87
Range N=19–335 typical and atypical developmentr=.86–.98r=.88–.95r=.62 (Gesell scale)–.90 (PDMS-2)

TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator–Individual Growth and Development Indicators, a general outcome measure of movement Gesell scale=Revised Gesell and Amatruda Developmental and Neurologic Examination; BSID=Bayley Scales of Infant Development; BSID-3=Bayley Scales of Infant and Toddler Development, 3rd edition; BDI=Battelle Developmental Inventory; PDMS2=Peabody Developmental Motor Scales-2; CAMS-GM=Caregiver Assessment of Movement Skills–Gross Motor; NA=not applicable. r=Pearson product moment correlation coefficient.

Play-Based Assessments and Studies With Motor Psychometric Properties a

Play-Based Assessments . Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural Validity .
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 motor, r=.825
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4.4) [24–40]
r=.95 (total group)
r=.95 (atypical development)
r=.93 (total group)
r=.94 (atypical development)
Gesell scale gross motor test (developmental quotient):
r=.79 (total group)
r=.89 (atypical development)
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.96r=.95BSID motor age, r=.88
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
Gross motor, r=.86NAGesell scale gross motor test, r=.62
BDI, r=.65
EMI-IGDIGreenwood et al 21 N=29
24 typical, 5 atypical development
Age: 15.3 (9.6) [1–34]
r=.98r=.88PDMS-2 locomotor, r=.90, .86
Stationary, r=.80, .77
CAMS-GM, r=.85, .87
Range N=19–335 typical and atypical developmentr=.86–.98r=.88–.95r=.62 (Gesell scale)–.90 (PDMS-2)
Play-Based Assessments . Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural Validity .
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 motor, r=.825
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4.4) [24–40]
r=.95 (total group)
r=.95 (atypical development)
r=.93 (total group)
r=.94 (atypical development)
Gesell scale gross motor test (developmental quotient):
r=.79 (total group)
r=.89 (atypical development)
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.96r=.95BSID motor age, r=.88
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
Gross motor, r=.86NAGesell scale gross motor test, r=.62
BDI, r=.65
EMI-IGDIGreenwood et al 21 N=29
24 typical, 5 atypical development
Age: 15.3 (9.6) [1–34]
r=.98r=.88PDMS-2 locomotor, r=.90, .86
Stationary, r=.80, .77
CAMS-GM, r=.85, .87
Range N=19–335 typical and atypical developmentr=.86–.98r=.88–.95r=.62 (Gesell scale)–.90 (PDMS-2)

TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator–Individual Growth and Development Indicators, a general outcome measure of movement Gesell scale=Revised Gesell and Amatruda Developmental and Neurologic Examination; BSID=Bayley Scales of Infant Development; BSID-3=Bayley Scales of Infant and Toddler Development, 3rd edition; BDI=Battelle Developmental Inventory; PDMS2=Peabody Developmental Motor Scales-2; CAMS-GM=Caregiver Assessment of Movement Skills–Gross Motor; NA=not applicable. r=Pearson product moment correlation coefficient.

Play-Based Assessments and Studies With Cognitive Psychometric Properties a

Play-Based
Assessments .
Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
PBAKelly-Vance et al 16 N=38
31 typical, 7 atypical development
Age: 24 mo 15 d [23 mo 10 d27 mo 26 d]
NANABSID-2 MDI: r= . 746
PIECESKelly-Vance and Ryalls 23 N=32
25 typical, 7 atypical development
Age: typical: 32.44 [19–46]; atypical: 37.57 [22–52]
NATypically developing, r=.48
Atypically developing, r=.58
NA
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 cognitive, r=.91
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4A) [24–40]
r=.23 (total group)
r=.32 (atypical development)
r=.46 (total group)
r=.45 (atypical development)
NA
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.90r=.91BSID mental age, r=.93
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
r=.88NABDI, r=.65
EPSI-IGDIGreenwood et al 26 N=28
23 typical, 5 atypical development
Age: 31.4 (8.0) [14.6–46.4]
r=.70–.99 (4 individual skills)r=.88BSID-2 MDI, r=.42
Range N=19–68 typical and atypical developmentr=.23–.90 overall interrater reliability cannot be determined for Greenwood et al 26 )r=.45–.91r=.42 (BSID-2 MDI)–.93 (BSID mental age)
Play-Based
Assessments .
Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
PBAKelly-Vance et al 16 N=38
31 typical, 7 atypical development
Age: 24 mo 15 d [23 mo 10 d27 mo 26 d]
NANABSID-2 MDI: r= . 746
PIECESKelly-Vance and Ryalls 23 N=32
25 typical, 7 atypical development
Age: typical: 32.44 [19–46]; atypical: 37.57 [22–52]
NATypically developing, r=.48
Atypically developing, r=.58
NA
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 cognitive, r=.91
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4A) [24–40]
r=.23 (total group)
r=.32 (atypical development)
r=.46 (total group)
r=.45 (atypical development)
NA
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.90r=.91BSID mental age, r=.93
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
r=.88NABDI, r=.65
EPSI-IGDIGreenwood et al 26 N=28
23 typical, 5 atypical development
Age: 31.4 (8.0) [14.6–46.4]
r=.70–.99 (4 individual skills)r=.88BSID-2 MDI, r=.42
Range N=19–68 typical and atypical developmentr=.23–.90 overall interrater reliability cannot be determined for Greenwood et al 26 )r=.45–.91r=.42 (BSID-2 MDI)–.93 (BSID mental age)

PBA=Play-Based Assessment; P1ECES=Play in Early Childhood Evaluation System; TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator–Individual Growth and Development Indicators, a general outcome measure of movement; EPSI-IGDI: Early Problem Solving Indicator–Individual Growth and Development Indicators. The EMI-IGDI and EPSI-IGDI are each independent parts of the larger assessment tool, the Individual Growth and Development Indicators (IGDI); NA=not applicable; BSID=Bayley Scales of Infant Development; BSID-2=Bayley Scales of Infant Development, 2nd edition; BSID-3=Bayley Scales of Infant and Toddler Development, 3rd edition; MDI=Mental Development Index; BDI=Battelle Developmental Inventory. r=Pearson product moment correlation coefficient.

Play-Based Assessments and Studies With Cognitive Psychometric Properties a

Play-Based
Assessments .
Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
PBAKelly-Vance et al 16 N=38
31 typical, 7 atypical development
Age: 24 mo 15 d [23 mo 10 d27 mo 26 d]
NANABSID-2 MDI: r= . 746
PIECESKelly-Vance and Ryalls 23 N=32
25 typical, 7 atypical development
Age: typical: 32.44 [19–46]; atypical: 37.57 [22–52]
NATypically developing, r=.48
Atypically developing, r=.58
NA
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 cognitive, r=.91
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4A) [24–40]
r=.23 (total group)
r=.32 (atypical development)
r=.46 (total group)
r=.45 (atypical development)
NA
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.90r=.91BSID mental age, r=.93
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
r=.88NABDI, r=.65
EPSI-IGDIGreenwood et al 26 N=28
23 typical, 5 atypical development
Age: 31.4 (8.0) [14.6–46.4]
r=.70–.99 (4 individual skills)r=.88BSID-2 MDI, r=.42
Range N=19–68 typical and atypical developmentr=.23–.90 overall interrater reliability cannot be determined for Greenwood et al 26 )r=.45–.91r=.42 (BSID-2 MDI)–.93 (BSID mental age)
Play-Based
Assessments .
Study . Sample Characteristics: Sample
Size, Typical/Atypical
Development, Age (mo)
at Beginning of Study:
X̅ (SD) [Range] .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
PBAKelly-Vance et al 16 N=38
31 typical, 7 atypical development
Age: 24 mo 15 d [23 mo 10 d27 mo 26 d]
NANABSID-2 MDI: r= . 746
PIECESKelly-Vance and Ryalls 23 N=32
25 typical, 7 atypical development
Age: typical: 32.44 [19–46]; atypical: 37.57 [22–52]
NATypically developing, r=.48
Atypically developing, r=.58
NA
TPBA-2Linas 19 N=19
12 typical, 7 atypical development
Age: 23.05 (10.36) [0–36]
NANABSID-3 cognitive, r=.91
EPS-IBailey and Bricker 35 N=32
10 typical, 22 atypical development
Age: typical: 29.7 (7.5) [20–39]; atypical: 30.7 (4A) [24–40]
r=.23 (total group)
r=.32 (atypical development)
r=.46 (total group)
r=.45 (atypical development)
NA
Bricker et al 36 N=335
90 typical, 245 atypical development
Age: 2–72 (majority less than 48)
r=.90r=.91BSID mental age, r=.93
AEPS:EMacy et al 20 N=68
35 typical, 33 atypical development (receiving early intervention services)
Age: 18–36
r=.88NABDI, r=.65
EPSI-IGDIGreenwood et al 26 N=28
23 typical, 5 atypical development
Age: 31.4 (8.0) [14.6–46.4]
r=.70–.99 (4 individual skills)r=.88BSID-2 MDI, r=.42
Range N=19–68 typical and atypical developmentr=.23–.90 overall interrater reliability cannot be determined for Greenwood et al 26 )r=.45–.91r=.42 (BSID-2 MDI)–.93 (BSID mental age)

PBA=Play-Based Assessment; P1ECES=Play in Early Childhood Evaluation System; TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator–Individual Growth and Development Indicators, a general outcome measure of movement; EPSI-IGDI: Early Problem Solving Indicator–Individual Growth and Development Indicators. The EMI-IGDI and EPSI-IGDI are each independent parts of the larger assessment tool, the Individual Growth and Development Indicators (IGDI); NA=not applicable; BSID=Bayley Scales of Infant Development; BSID-2=Bayley Scales of Infant Development, 2nd edition; BSID-3=Bayley Scales of Infant and Toddler Development, 3rd edition; MDI=Mental Development Index; BDI=Battelle Developmental Inventory. r=Pearson product moment correlation coefficient.

The methodological quality in each of these studies was rated fair except for the study by Bailey and Bricker, 35 which was poor ( Tab. 4). Main reasons for fair ratings included use of a Pearson correlation coefficient rather than an intraclass correlation coefficient (ICC) 20, 21, 26, 36 and missing items from the sample. 26, 36 The study by Bailey and Bricker 35 was rated poor based on the small sample size and major flaws in the study design, including that items not observed by one or both observers were omitted from analysis. It is possible that one observer missed several test items that actually occurred. This omission would have reduced the variability of the sample and artificially inflated the correlation coefficient.

Methodological Quality of Measurement Properties Using the COSMIN for All Included Studies a

Study . Play-Based
Assessment .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
Content
Validity .
Responsiveness .
Greenwood et a1 21 EMI-IGDIFairPoorFairGoodPoor
Greenwood et al 26 EPSI-IGDIFairFairFair
Macy et al 20 AEPS:EFair Poor
Bricker et al 36 EPS-IFairFairFair
Bailey and Bricker 35 EPS-1PoorPoorPoor
Kelly-Vance et al 16 PBA Fair
Kelly-Vance and Ryalls 23 PIECES Fair
Linas 19 TPBA-2 Poor
Study . Play-Based
Assessment .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
Content
Validity .
Responsiveness .
Greenwood et a1 21 EMI-IGDIFairPoorFairGoodPoor
Greenwood et al 26 EPSI-IGDIFairFairFair
Macy et al 20 AEPS:EFair Poor
Bricker et al 36 EPS-IFairFairFair
Bailey and Bricker 35 EPS-1PoorPoorPoor
Kelly-Vance et al 16 PBA Fair
Kelly-Vance and Ryalls 23 PIECES Fair
Linas 19 TPBA-2 Poor

COSMIN=Consensus-based Standards for the selection of health status Measurement Instruments; PBA=Play-Based Assessment PIECES=Play in Early Childhood Evaluation System; TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator—Individual Growth and Development Indicators, a general outcome measure of movement; EPSI-IGDI: Early Problem Solving Indicator—Individual Growth and Development Indicators. The EMI-IGDI and EPSI-IGDI are each independent parts of the larger assessment tool, the Individual Growth and Development Indicators (IGDI).

Methodological Quality of Measurement Properties Using the COSMIN for All Included Studies a

Study . Play-Based
Assessment .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
Content
Validity .
Responsiveness .
Greenwood et a1 21 EMI-IGDIFairPoorFairGoodPoor
Greenwood et al 26 EPSI-IGDIFairFairFair
Macy et al 20 AEPS:EFair Poor
Bricker et al 36 EPS-IFairFairFair
Bailey and Bricker 35 EPS-1PoorPoorPoor
Kelly-Vance et al 16 PBA Fair
Kelly-Vance and Ryalls 23 PIECES Fair
Linas 19 TPBA-2 Poor
Study . Play-Based
Assessment .
Interrater
Reliability .
Test-Retest
Reliability .
Structural
Validity .
Content
Validity .
Responsiveness .
Greenwood et a1 21 EMI-IGDIFairPoorFairGoodPoor
Greenwood et al 26 EPSI-IGDIFairFairFair
Macy et al 20 AEPS:EFair Poor
Bricker et al 36 EPS-IFairFairFair
Bailey and Bricker 35 EPS-1PoorPoorPoor
Kelly-Vance et al 16 PBA Fair
Kelly-Vance and Ryalls 23 PIECES Fair
Linas 19 TPBA-2 Poor

COSMIN=Consensus-based Standards for the selection of health status Measurement Instruments; PBA=Play-Based Assessment PIECES=Play in Early Childhood Evaluation System; TPBA-2=Transdisciplinary Play-Based Assessment, 2nd edition; EPS-I=Evaluation and Programming System for Infants and Young Children; AEPS:E=Assessment, Evaluation, and Programming System for Eligibility; EMI-IGDI=Early Movement Indicator—Individual Growth and Development Indicators, a general outcome measure of movement; EPSI-IGDI: Early Problem Solving Indicator—Individual Growth and Development Indicators. The EMI-IGDI and EPSI-IGDI are each independent parts of the larger assessment tool, the Individual Growth and Development Indicators (IGDI).

Test-Retest Reliability

Test-retest reliability was assessed in 5 studies. 21, 23, 26, 35, 36 Test-retest reliability of motor assessments had a range of Pearson correlation coefficients of .88 to .95 ( Tab. 2), 21, 35, 36 and cognitive test-retest reliability had a range of Pearson correlation coefficients of .45 to .91 ( Tab. 3). 23, 26, 35, 36

Three of the studies 23, 26, 36 had fair methodological quality ratings, whereas 2 studies 21, 35 were rated poor ( Tab. 4). Areas in which these studies received low ratings included using a Pearson correlation coefficient rather than an ICC 21, 26, 36 and small sample size 23 and important flaws in the study design, including different observers for the first and second test observations, different situations between observations, and items not scored by either observer being omitted from the test. 35

Structural Validity

Structural validity was assessed in 7 studies by comparing the scores on a play-based assessment with scores on traditional standardized developmental tests. 16, 19– 21, 26, 35, 36 Three of the studies had both motor and cognitive components ( Tabs. 2 and 3). 19, 20, 36 Two studies assessed solely motor skills during play ( Tab. 2), 21, 35 and 2 studies were solely cognitive assessments ( Tab. 3). 16, 26 The Pearson correlations between the play-based assessments of motor skills and validated traditional standardized developmental tests of motor skills ranged from .62 to .90. Correlations between the play-based assessments of cognitive skills and validated traditional standardized developmental tests ranged from .42 to .93. Several different validated traditional standardized developmental tests of motor and cognition were used as comparisons ( Tabs. 2 and 3).

Methodological quality for structural validity was fair to poor ( Tab. 4). Four studies earned fair ratings. All of these studies had missing items, and the methods used for handling missing items were unclear. 16, 21, 26, 36 Two of these studies also had methodological flaws. 16, 36 In one of these studies, 16 the age equivalents on the play-based assessment were converted to standard scores for comparison with the standard scores on the Bayley Scales of Infant Development, second edition (BSID-2). 39 This comparison has not been validated as statistically sound. The other study 36 was validated with 2 traditional standardized developmental tests: the Revised Gesell and Amatruda Developmental and Neurologic Examination (Gesell scale) and the Bayley Scales of Infant Development (BSID). 40, 41 The disability levels and ages of each of the samples were unclear. The sample compared with the BSID also was small. The other 3 studies that assessed structural validity were marked as poor due to a small sample size. 19, 20, 35

Content Validity

Content validity of motor skills was assessed using the EMI-IGDI. 21 It was found that there was a significant increase in the frequency of movements among the 3 age cohorts (3–12 months, 13–24 months, and 25–36 months) during the 45-week study. Methodological quality was good in this study based on a moderate sample size ( Tab. 4). 21

Responsiveness

One study 21 measured responsiveness of a play-based assessment indirectly. This study assessed only motor skills in a sample of mostly children with typical development (3–36 months of age) on a play-based assessment (ie, the EMI-IGDI) and the Peabody Developmental Motor Scale–2 (PDMS-2). The PDMS-2 locomotor and stationary subtests were responsive to change, with a statistically significant increase in raw score between 2 time points that were 45 weeks apart. A similar comparison was not made for the data from the EMI-IGDI. The movement rate on the EMI-IGDI, however, was correlated with the PDMS-2 locomotion scale at each of the 2 time points. The Pearson correlation coefficient was .77 at time 1 and .90 at a time point 45 weeks later. Methodological quality was poor due to the small sample size ( Tab. 4).

Discussion

The results of this systematic review indicate that Pearson r values for interrater and test-retest reliability of play-based assessments ranged from .23 to .98 and that Pearson r values for structural validity of play-based assessments ranged from .42 to .93. As a group, both reliability and validity of play-based assessments are inconsistent. The methodological quality of measurement properties among the studies contained in this systematic review is generally poor to fair, 33 with only one study having a good quality rating. With only 1 or 2 studies of reliability or validity on each play-based assessment tool and the poor to fair methodological quality of the studies, it was difficult to draw conclusions about any individual assessment or the group of play-based assessments as a whole. Therefore, reliability and validity for each play-based assessment need to be considered carefully before research or clinical application.

Interrater reliability of play-based assessments for both motor and cognitive skills was generally strong. One of 8 studies had a weak interrater reliability correlation, but the majority of the studies had interrater reliability correlations of r≥.86. These interrater reliability findings indicate that the definitions of terms and scoring used in the assessments are clear to raters. Of the tests reviewed for both motor and cognitive skills, interrater reliability findings are highest for the AEPS:E 20 and its predecessor, the EPS-I. 35, 36 The assessment with the best interrater reliability for motor skills is the EMI-IGDI. 21 Traditional standardized developmental tests of motor and cognitive development, such as the PDMS-2 42 and the BSID-2, 39 have similar strong interrater reliability. 43

Test-retest reliability scores varied by construct measured. Test-retest reliability was strong for all 3 studies of motor development. 21, 35, 36 Two of the studies used the same test (EPS-I) with different age groups. 35, 36 The EPS-I measured motor tasks using a criterion-referenced, curriculum-based assessment with a modified developmental checklist. The fact that environments, objects, and checklist questions were controlled may have improved reliability. The other play-based assessment of motor skills, the EMI-IGDI, measured skills in a longitudinal fashion with 3 to 8 measurements per child, separated by at least 3 weeks. 21 The researchers used a split-half reliability method in which they averaged the odd trials and the even trials before comparing the average of the odd and even trials. Although this is an acceptable way of measuring test-retest reliability, it reduced variability, which otherwise may have resulted in lower test-retest reliability. The age of the children did not affect the reliability of either the EPS-I or the EMI-IGDI. The Bayley Scales of Infant and Toddler Development, 3rd edition (BSID-3) and the PDMS-2 have similar strong test-retest reliability for motor skills. 44, 45

Test-retest reliability was lower for cognition than for motor skills. This finding was evident with 2 tests, the PIECES and the EPS-I for children with and without disabilities. 23, 35, 36 The reliability of the PIECES was assessed using a strict test-retest method (1–3 weeks between assessments) of the children's most advanced cognitive skill level used during play. 23 Test-retest reliability correlations of the PIECES were similar to those of the initial study with the EPS-I. 35 The test-retest reliability of the EPS-I, however, varied substantially between the 2 studies that evaluated this measure. 35, 36 Two different methods were used to assess test-retest reliability, which might account for this discrepancy. Bailey and Bricker 35 used different observers and different situations during test-retest measurements, which expanded the opportunity for error. In the study during which test-retest reliability was stronger, several items were not tested due to issues of privacy during videotaping (self-care, dressing), and several items from the gross motor scale were omitted due to constraints of videotaping. 36 This approach creates a smaller pool of scored behaviors, which may not represent the reliability of the test as a whole. In light of these contrasting findings, we suggest that the complete EPS-I test-retest reliability cannot be identified. The EPSI-IGDI 26 had strong test-retest reliability using the split-half reliability method. Given the moderate or unclear results of the other play-based assessments of cognition, the EPSI-IGDI has the best test-retest reliability, although it must be noted that single-session test-retest studies of the EPSI-IGDI (no split-half reliability) may show lower reliability due to changes in play behaviors that commonly occur from one session to the next.

Structural validity of play-based assessments that measure motor skills ranged from moderate to strong compared with traditional standardized developmental tests. The lowest structural validity was found for the AEPS:E 20 compared with the Gesell scale's gross motor portion. 40 All other play-based assessments had strong structural validity (correlations greater than .76). The EPS-I, an earlier version of the AEPS:E, had a strong correlation with the Gesell scale's gross motor test. 35 The EPS-I items are arranged in a hierarchical developmental progression, similar to the neuromaturational construct that the Gesell scale tests. The AEPS:E 20 uses different standardization procedures, activities, and materials appropriate for toddlers, which are not as strongly aligned with the hierarchical model of development. Although the lack of hierarchical examination in the AEPS:E is more consistent with current theoretical approaches, it reduces the relationship between the AEPS:E and the Gesell scale. Comparison with a different traditional standardized developmental test might have increased the structural validity. The TPBA-2 motor section had a strong correlation compared with the BSID-3, 19 and the EMI-IGDI likewise had a strong correlation compared with the PDMS-2. 21 The BSID-3 and PDMS-2 assess developmental constructs similar to play-based measures, whereas the Gesell does not. There is no published study regarding the structural validity of the AEPS:E motor portion compared with a more current standardized test of motor development than the Gesell scale. Therefore, we suggest that the TPBA-2 and EMI-IGDI are the best play-based assessments of motor skills to assess a construct similar to traditional standardized tests of motor development.

Structural validity of play-based assessments of cognitive skills was generally lower than that of play-based assessments of motor skills. Structural validity of the AEPS:E was moderate compared with the Battelle Developmental Inventory. 20 It is interesting to note that the correlation between the AEPS:E and the Battelle Developmental Inventory was lower in children older than 24 months. We hypothesize that this finding may have been due to the fact that the older children did not display the full range of their cognitive skills during a play session with a limited set of toys and space in contrast to a traditional standardized developmental assessment, which tests the child on specific skills of all difficulty levels. Structural validity was weak to moderate for the EPSI-IGDI 26 and the PBA 16 compared with the BSID-2. Child-driven free play tends to decrease validity because the child may or may not show his or her full repertoire of cognitive skills during a given test session. Validity was strong when more structure was part of the play-based assessment such as in the EPS-I. 36

The methodological quality of measurement properties as defined by COSMIN 32 in each of the studies in this systematic review was poor to fair, with the exception of content validity in one study that was rated as good. One measurement property for 3 studies was downgraded due to small sample size. 19, 23, 35 Other methodological issues for these studies included no evidence of patient stability on test-retest, 23 methodological flaws in study design, 19 and unclear handling of missing data. 35 Use of a Pearson correlation coefficient instead of an ICC for reliability reduced the methodological quality rating for all included studies. Only one study, however, had a reliability score that would have been upgraded had the ICC been used. 20 Although no one methodological issue affected all the studies, each included study had some methodological problems. Future studies that adhere to rigorous methods will provide more detailed information about using play-based assessments for research and clinical measurement.

Play-based assessment tools are designed to assess a child's ability to use motor and cognitive skills during self-motivated play within contextually relevant environments. Although play-based assessment allows the child to select the activities, this systematic review demonstrates that the overall reliability of these measures is similar to more traditional standardized developmental tests. The slightly lower test-retest reliability is likely the result of the varied responses children display when given toys in the naturalistic environment during play but not specifically prompted to react to the toy in a certain manner, such as during a traditional standardized developmental test. The validity results of this review of play-based assessments suggest that as a group, play-based assessments measure a construct that is similar, but not identical, to traditional standardized developmental tests. Similar to reliability findings, we suggest that slightly lower validity findings of play-based assessments may be acceptable because of the naturalistic context of activities during the assessment. Studies of individual play-based assessments, however, indicate varied structural validity correlations. As a result of varied validity and poor to fair methodological quality, individual tests need additional research to document reliability and validity in high-quality studies. At present, results using play-based assessments should be interpreted with caution.

Limitations

This systematic review has several limitations. The broad nature of the search terms resulted in a very large number of titles and abstracts, which required screening of title and abstract. Although only a single author reviewed each title and abstract, the criteria to eliminate a study based on title and abstract were designed to retain any study that might meet the inclusion criteria. Two reviewers completed all other eligibility determination and data extraction. These search criteria could have resulted in missing other potentially relevant studies of the reliability or validity of play-based assessment tools. Also, COSMIN was originally developed for assessing the quality of health-related patient-reported outcome measures in order to assess complex subjective health changes over time. 29 Although play-based assessments do not fit the type of studies usually assessed using the COSMIN, the majority of measurement quality assessment tables could be completed without difficulty.

In conclusion, a standardized assessment of skills used during play is critical to determining the need for and efficacy of developmental intervention, yet therapists are challenged to find tools to meet this objective. 46 Although the challenge continues, the results of this systematic review demonstrate that play-based assessments have the potential to be reliable and valid tools. Researchers must continue to assess reliability and validity of specific play-based assessment tools and reassess psychometrics as adaptations are made to the tools. Before play-based assessments can be used as evaluative measures, responsiveness to change must be evaluated. Changes in skills in response to therapeutic intervention, as measured on play-based assessments, would provide not only evidence that therapy can teach a child a new skill but also evidence that the child can spontaneously use the new skills in daily activity. Determining adequate responsiveness of play-based assessments would give early developmental intervention therapists an opportunity to use play not just as a process of intervention but also as a reliable and valid method of assessing development. A primary therapeutic goal in all cultures is to enhance a child's use of functional skills for participation in age-appropriate activities such as play. Play-based assessments improve the ability of clinicians and researchers to measure the impact of therapeutic interventions during these age-appropriate activities for children.

The authors acknowledge the contributions of Virginia Commonwealth University Health Sciences librarian Jennifer McDaniel for her assistance with defining search terms.

The data were presented as a poster at the Virginia Commonwealth University Graduate Student Symposium; April 22, 2014; Richmond, Virginia. The data also have been submitted as an abstract for the American Physical Therapy Association's Combined Sections Meeting; February 4–7, 2015; Indianapolis, Indiana.