Good Behavior Game

A classroom behavior management game providing a strategy to help elementary teachers reduce aggressive, disruptive behavior and other behavioral problems in children, particularly highly aggressive children, while creating a positive and effective learning environment.

Program Outcomes

Antisocial-aggressive Behavior
Internalizing
Mental Health - Other
Suicide/Suicidal Thoughts
Tobacco

Program Type

School - Individual Strategies

Program Setting

School

Continuum of Intervention

Universal Prevention

Age

Late Childhood (5-11) - K/Elementary

Gender

Both

Race/Ethnicity

Endorsements

Blueprints: Promising
Crime Solutions: Effective
OJJDP Model Programs: Effective
SAMHSA : 3.1-3.2

Program Information Contact

Megan Sambolt
American Institutes for Research
Phone: (202) 403-5223
Email: msambolt@air.org
Website: www.air.org/resource/spotlight/good-behavior-game

Program Developer/Owner

Sheppard G. Kellam, M.D., Retired
Johns Hopkins Bloomberg School of Public Health

Brief Description of the Program

The Good Behavior Game (GBG) is a classroom-based behavior management strategy for elementary school that teachers use along with a school's standard instructional curricula. GBG uses a classroom-wide game format with teams and rewards to socialize children to the role of student and reduce aggressive, disruptive classroom behavior, which is a risk factor for adolescent and adult illicit drug abuse, alcohol abuse, cigarette smoking, antisocial personality disorder (ASPD), and violent and criminal behavior.

In GBG classrooms, the teacher assigns all children to teams, balanced with regard to gender; aggressive, disruptive behavior; and shy, socially isolated behavior. Basic classroom rules of student behavior are posted and reviewed. When GBG is played, each team is rewarded if team members commit a total of four or fewer infractions of the classroom rules during game periods. During the first weeks of the intervention, GBG is played three times a week for 10 minutes each time during periods of the day when the classroom environment is less structured and the students are working independently of the teacher. Game periods are increased in length and frequency at regular intervals; by mid-year the game may be played every day. Initially, the teacher announces the start of a game period and gives rewards at the conclusion of the game. Later, the teacher defers rewards until the end of the school day or week. Over time, GBG is played at different times of the day, during different activities, and in different locations, so the game evolves from being highly predictable in timing and occurrence with immediate reinforcement to being unpredictable, with delayed reinforcement so that children learn that good behavior is expected at all times and in all places.

See: Full Description

In GBG classrooms, the teacher assigns all children to teams, usually three, balanced with regard to gender; aggressive, disruptive behavior; and shy, socially isolated behavior. The teacher assigns a team leader, usually a shy child, to organize activities and pass out rewards. Next, the teacher explains the rules of the game, describing what behaviors will not be allowed during the period in which the GBG is played (which are usually verbal disruption, physical disruption, out-of-seat without permission, and noncompliance), and the rules are posted in the classroom.

During the game, the teacher notes the occurrence of problem behaviors by placing checkmarks next to the name of a team whenever one of its members displays a targeted prohibited behavior. The teacher neutrally states the behavior that was displayed, identifies the child who displayed it, and praises the other teams for behaving well. Each team wins the game if the number of checkmarks does not exceed four at the end of the game period, and more than one team can win the game. Initially, winning team members receive tangible rewards (stickers, erasers) and activities (extra recess, class privileges). In addition, any team that wins a game during the week receives a special reward on Friday (such as a party, or an outdoor activity). Non-winners engage in quiet seat-work during this time, and they receive no special attention from the teacher.

During the first weeks of the intervention, GBG is played three times a week for 10 minutes each time during periods of the day when the classroom environment is less structured and the students are working independently of the teacher. Game periods are increased in length and frequency at regular intervals; by mid-year the game is played every day. Initially, the teacher announces the start of a game period and gives rewards at the conclusion of the game. Later, the teacher defers rewards until the end of the school day or week. Over time, GBG is played at different times of the day, during different activities, and in different locations, so the game evolves from being highly predictable in timing and occurrence with immediate reinforcement to being unpredictable with delayed reinforcement so that children learn that good behavior is expected at all times and in all places.

Outcomes

Primary Evidence Base for Certification

Study 1

The eight reports found the following:

At posttest (end of first grade), the GBG had a significant impact on aggressive and shy behavior for both males and females as rated by teachers (Dolan et al., 1993).
Peer nominations of aggressive behavior by their classmates were significantly reduced at posttest for males (Dolan et al., 1993).
At posttest, for peer and teacher ratings, the more severely aggressive children responded the most to GBG (Dolan et al., 1993).
At the six-year follow-up, there were no main program effects for the total population of males or females, but GBG had an increasing effect on aggression among males at or above the median on aggression in first grade (Kellam et al., 1994).
At 14-year follow-up, GBG, compared to internal controls and all controls, had significant impact on lifetime alcohol abuse/dependence and antisocial behavior among all youth, and on smoking and lifetime illicit drug use among males (Kellam et al., 2008).
In young adulthood, Cohort 2 GBG males were less likely to have used any services or mental and medical health services (Poduska et al., 2008).
GBG students in Cohort 1 at ages 19-21 were significantly less likely to have experienced suicide ideation as compared to those in the control group, and mixed effects, depending upon the model used, were found for suicide attempts (Wilcox et al., 2008).
Cohort 1 and 2 GBG males in the high aggression trajectory showed significantly lower slopes of aggressive, disruptive behavior sustained through 7th grade; sustained for females in the aggression trajectory through grade 4 (Petras et al., 2008).
Cohort 1 GBG males in the high aggression trajectory showed lower rates of antisocial personality disorder and violent and criminal behavior by young adulthood compared to controls (Petras et al., 2008).
Cohort 1 GBG males in the high aggression trajectory, as compared to controls, had a higher prevalence of lifetime condom use, later initiation of vaginal sex, and lower prevalence of a lifetime high risk sexual behavior composite score by ages 19-21 (Kellam et al., 2014).

Brief Evaluation Methodology

Primary Evidence Base for Certification

Of the 10 studies Blueprints has reviewed, one (Study 1) meets Blueprints evidentiary standards (specificity, evaluation quality, impact, dissemination readiness). The study was done by the developer.

Study 1

Dolan et al. (1993), Kellam and Rebok (1992), and six other articles reported on a randomized controlled trial that examined 19 schools in five different urban areas of Baltimore City, Maryland. The schools were assigned to one of three conditions: the Good Behavior Game, Mastery Learning, or an external no-intervention control group. Each intervention school's classrooms and first-grade classroom teachers were randomly assigned to either an intervention or to serve as an internal control class (receiving no intervention). The full trial involved 42 classrooms in 19 schools. A cohort of 1,084 first-grade children was assessed at baseline from among 1,197 children available for participation. The short-term impact of the GBG on aggression and shyness was assessed at posttest, then annually through middle school, and the long-term impact of the GBG was examined in a 14-year follow-up study. While the first cohort of students continued implementation in Grade 2, a new cohort of students entered the first grade and was also assigned to conditions for evaluation purposes. Cohort 2 consisted of 1,117 first-grade children.

Blueprints Certified Studies

Study 1

Dolan, L. J., Kellam, S. G., Brown, C. H., Werthamer-Larsson, L., Rebok, G. W., Mayer, L. S., . . . Wheeler, L. (1993). The short-term impact of two classroom-based preventive interventions on aggressive and shy behaviors and poor achievement. Journal of Applied Developmental Psychology, 14, 317-345.

Kellam, S. G., Brown, C. H., Poduska, J., Ialongo, N., Wang, W., Toyinbo, P., . . . Wilcox, H. (2008). Effects of a universal classroom behavior management program in first and second grades on young adult behavioral, psychiatric, and social outcomes. Drug and Alcohol Dependence, 95(Suppl 1), 5-28.

Poduska, J. M., Kellam, S. G., Wang, W., Brown, C. H., Ialongo, N. S., & Toyinbo, P. (2008). Impact of the Good Behavior Game, a universal classroom-based behavior intervention, on young adult service use for problems with emotions, behavior, or drugs or alcohol. Drug and Alcohol Dependence, 95(Suppl 1), 29-44.

Wilcox, H. C., Kellam, S. G., Brown, C. H., Poduska, J. M., Iallongo, N. S., Wang, W., & Anthony, J. C. (2008). The impact of two universal randomized first- and second-grade classroom interventions on young adult suicide ideation and attempts. Drug and Alcohol Dependence, 95(Suppl 1), S60-S73.

Risk and Protective Factors

Risk Factors

Individual: Antisocial/aggressive behavior*, Early initiation of antisocial behavior

Protective Factors

Individual: Clear standards for behavior

Peer: Interaction with prosocial peers

School: Opportunities for prosocial involvement in education, Rewards for prosocial involvement in school

* Risk/Protective Factor was significantly impacted by the program

Subgroup Analysis Details

Gender Specific Findings

Male
Female

Subgroup Analysis Details

Subgroup differences in program effects by race, ethnicity, or gender (coded in binary terms as male/female) or program effects for a sample of a specific racial, ethnic, or gender group:

Study 1 (Kellam et al., 2008; Wilcox et al., 2008) tested for subgroup differences in program effects by gender and found either stronger benefits for males or equal benefits for males and females. The study also tested for within-subgroup program effects by gender and found significant benefits for both males and females (Kellam et al., 1994, 2008, 2014; Petras et al., 1994; Poduska et al., 2008).

Sample demographics including race, ethnicity, and gender for Blueprints-certified studies:

The Study 1 sample was 49% male, 65% African American, 31% Caucasian, and 4% other ethnic groups.

Training and Technical Assistance

There are two strands of professional development for GBG: one for teachers and one for local coaches and trainers.

Professional Development for Teachers
Teacher training and support focuses on understanding the practices and procedures for effectively implementing GBG and integrating GBG core elements into daily classroom life at high levels of quality over time. Teachers are trained in GBG practices through several activities.

Initial Two-Day Group Based GBG Training: Teachers participate in two days of initial training on GBG. The training includes formal sessions on GBG core elements, demonstrations of strategies and procedures, guided practice in delivery of GBG, and how to generalize practices.
Booster Session: The initial training is supported by a six-hour group-based booster session that is held in the middle of the school year.
Support of the Coach: In addition to group-based trainings, coaching is provided directly to teachers in their classrooms. GBG coaches provide ongoing support throughout the school year. In the first semester, the coach visits each teacher in the classroom every other week for 90 minutes. In the second semester, the amount of coaching is individualized based on the needs of each teacher. GBG coaches work with teachers to set up optimal conditions for playing GBG, determine appropriate times to play the game, establish GBG teams, and choose appropriate rewards and incentives for students. GBG coaches spend their time working directly with teachers in classrooms by observing, planning, modeling and mentoring, and providing feedback.

Professional Development for GBG Coaches
Local GBG coaches are trained in situ over one year as they work with teachers.

Group-Based Training: GBG coaches participate in the two-day initial GBG training and the booster session as described above.
Initial Coach Training: There is an additional day for coaches which focuses on (1) using data and records to support effective GBG implementation and inform professional development; (2) designing individual professional development plans and providing ongoing support to teachers; (3) working with adult learners; and (4) providing group professional development to teams of teachers, other school-level staff, and principals.
Implementation Audits: As part of the certification process for a GBG coach, AIR trainers (American Institutes for Research, the program purveyors) conduct at least two implementation audits over the course of the school year. During each audit, an AIR trainer observes the GBG coach working with teachers.
Regular Support: Until certified, GBG coaches participate in supervision and support with AIR via e-mail and telephone. Thereafter, AIR provides support as needed. GBG coaches must work with at least five teachers during the course of a school year to become certified.

Training Certification Process

GBG Trainers
GBG coaches can become GBG trainers through additional training. Once certified as a GBG coach, an individual can participate in an additional year of training to become a GBG trainer. GBG trainers can lead all activities to train teachers. AIR conducts group-based trainings in a school district until a local GBG trainer is certified to conduct these trainings.

Co-led Group-Based Trainings: The GBG trainer co-leads the group-based trainings for teachers alongside an AIR trainer.
Implementation Audits: AIR conducts two on-site audits to observe the GBG trainer.
Regular Support: Until certified, GBG trainers participate in regular supervision and support with AIR via telephone and e-mail. Thereafter, AIR provides support as needed.

Benefits and Costs

Program Benefits (per individual): $11,983
Program Costs (per individual): $186
Net Present Value (Benefits minus Costs, per individual): $11,797
Measured Risk (odds of a positive Net Present Value): 76%

Source: Washington State Institute for Public Policy
All benefit-cost ratios are the most recent estimates published by The Washington State Institute for Public Policy for Blueprint programs implemented in Washington State. These ratios are based on a) meta-analysis estimates of effect size and b) monetized benefits and calculated costs for programs as delivered in the State of Washington. Caution is recommended in applying these estimates of the benefit-cost ratio to any other state or local area. They are provided as an illustration of the benefit-cost ratio found in one specific state. When feasible, local costs and monetized benefits should be used to calculate expected local benefit-cost ratios. The formula for this calculation can be found on the WSIPP website.

Start-Up Costs

Initial Training and Technical Assistance

The Good Behavior Game purveyor, American Institutes for Research (AIR), aims to develop local capacity to deliver the program. The program is implemented by teachers in the classroom. There are two strands of professional development: one for teachers and one for local GBG coaches and trainers. Teachers receive one year of GBG training that consists of group-based sessions enhanced by the support of a coach in the classroom. Local coaches are trained in situ over one year as they work with teachers. Teachers receive 3 days of on-site training: a 2-day Initial GBG Training and a 1-day Booster Session. Training for program Coaches includes attending the sessions above plus a one day Initial Coach Training . Coaches also receive at least two on-site Implementation Audits and up to 100 hours of technical assistance by phone and email in their first year. For the group-based trainings, the ratio is one trainer for up to 16 trainees. Costs for training program coaches are the same for 1-5 coaches. One full-time coach can support up to 16 teachers. Trainer on-site: $3,000/day plus travel expenses; Trainer off-site: $200/hour email/phone support and prep for trainings.

Curriculum and Materials

Initial set of teacher training and classroom materials at $600 per teacher/class, student incentives are budgeted for $100 per class per year, and Coach training materials at $200/set.

Licensing

None at this time.

Other Start-Up Costs

Readiness trainings combining on-site with e-mail and phone contact for a total of 1-5 days costs $3,000 per day plus travel for on-site and $200 off-site.

Intervention Implementation Costs

Ongoing Curriculum and Materials

Annual cost of $200 per classroom for replacement supplies and $100 for student incentives.

Staffing

Qualifications: Teachers are the primary implementers of the program and it is incorporated into the regular school day.

Ratios: The teacher to student ratio would reflect the regular ratios in the school. Coaches may support teachers in multiple schools. A full-time coach can support up to 16 tachers.

Time to Deliver Intervention: The Good Behavior Game is implemented throughout the school day as a behavior management strategy. It does not require dedicated class time.

Other Implementation Costs

Mileage if coaches must travel between schools.

Implementation Support and Fidelity Monitoring Costs

Ongoing Training and Technical Assistance

One-day on-site booster sessions are available for up to 20 participants. On-going support to teachers can be provided by a trained local coach. Once certified as a GBG coach, an individual can participate in additional training to become a GBG trainer. Trainers can lead all activities to train teachers including the group-based trainings described above.

Fidelity Monitoring and Evaluation

A web based fidelity monitoring system is under development.

Ongoing License Fees

None.

Other Implementation Support and Fidelity Monitoring Costs

No information is available

Other Cost Considerations

None.

Year One Cost Example

If a school district were to implement Good Behavior Game in 3 schools with 10 teachers per school, the following year one costs can be anticipated, excluding travel costs.

Initial GBG training: two days	$6,000.00
Booster session: one day	$3,000.00
Coach Training: one day	$3,000.00
Implementation Audits: 1 day (2)	$6,000.00
TA and prep support: 150 hours	$30,000.00
Materials and student incentives ($700 per class)	$10,500.00
Salary and fringe (one full-time coach)	$100,000.00
Total One Year Cost	$158,500.00

With 30 participating teachers having classes averaging 30 students, 900 students would receive the intervention at a cost of $176 per student for year one. With 30 participating teachers having classes averaging 30 students, over five years, with 4,050 students receiving the intervention, the cost is $81 per student.

Funding Overview

With teachers implementing Good Behavior Game during regular class time, the main categories of cost are initial training, materials and coach salaries. Grants, whether federal or from the foundation community, are a good option for initial costs.

Allocating State or Local General Funds

State education funds as well as local school district funding should be considered for both training and coach salaries.

Maximizing Federal Funds

Formula Funds: Title I funds can potentially support curricula purchase, training and teacher salaries in schools that are operating schoolwide Title I programs. While Good Behavior Game is integrated into the curriculum, it must be shown to contribute to overall academic achievement. Title 2 funds have been used by districts to provide GBG training and support to teachers.

Discretionary Grants: Federal discretionary grants from the Department of Education have been used to fund the initial training of teachers.

Foundation Grants and Public-Private Partnerships

Foundations can be a good source of funds, for initial training, curricula and funding for the coaches.

Generating New Revenue

Developing a community base which includes broad representation from multiple sectors including business, civic associations, parents, health, the pastoral community, and education can provide a strong foundation for ensuring sustainability of funding and support.

Data Sources

All information comes from the responses to a questionnaire submitted by the purveyor, American Institutes for Research, to the Annie E. Casey Foundation.

Program Developer/Owner

Sheppard G. Kellam, M.D., RetiredJohns Hopkins Bloomberg School of Public Health615 North Wolfe StreetBaltimore, MD 21205(410) 614-0680kellam@jhsph.edu www.air.org/focus-area/education/?id=127

Program Outcomes

Antisocial-aggressive Behavior
Internalizing
Mental Health - Other
Suicide/Suicidal Thoughts
Tobacco

Program Specifics

Program Type

School - Individual Strategies

Program Setting

School

Continuum of Intervention

Universal Prevention

Program Goals

Population Demographics

The Good Behavior Game has been implemented in first grade to low/middle SES classrooms in the U.S.

Target Population

Age

Late Childhood (5-11) - K/Elementary

Gender

Both

Gender Specific Findings

Male
Female

Race/Ethnicity

Subgroup Analysis Details

Subgroup differences in program effects by race, ethnicity, or gender (coded in binary terms as male/female) or program effects for a sample of a specific racial, ethnic, or gender group:

Study 1 (Kellam et al., 2008; Wilcox et al., 2008) tested for subgroup differences in program effects by gender and found either stronger benefits for males or equal benefits for males and females. The study also tested for within-subgroup program effects by gender and found significant benefits for both males and females (Kellam et al., 1994, 2008, 2014; Petras et al., 1994; Poduska et al., 2008).

Sample demographics including race, ethnicity, and gender for Blueprints-certified studies:

The Study 1 sample was 49% male, 65% African American, 31% Caucasian, and 4% other ethnic groups.

Other Risk and Protective Factors

Risk: early aggressive, disruptive behavior

Risk/Protective Factor Domain

Individual
School

Risk/Protective Factors

Risk Factors

Individual: Antisocial/aggressive behavior*, Early initiation of antisocial behavior

Protective Factors

Individual: Clear standards for behavior

Peer: Interaction with prosocial peers

School: Opportunities for prosocial involvement in education, Rewards for prosocial involvement in school

*Risk/Protective Factor was significantly impacted by the program

Brief Description of the Program

Description of the Program

Theoretical Rationale

The Good Behavior Game (GBG) was first described by Barrish, Saunders and Wolfe (1969) and subsequently studied in many short-term non-randomized observational studies with encouraging reports. It was chosen in the Baltimore developmental epidemiological multi-level randomized field trials because it specifically targeted early and persistent aggressive, disruptive classroom behaviors which as early as first grade have been repeatedly found to be a shared risk factor for later drug abuse, antisocial, and alcohol abuse disorders, as well as regular smoking and risky sex behaviors. GBG is directed at this early antecedent in the first grade as exhibited in behaviors such as fighting, destroying others' property, jumping out of seat, or talking without permission, i.e. breaking classroom rules. Researchers hypothesized that GBG would reduce these early antecedent risk behaviors and thereby lower the risk for later antisocial and drug and alcohol abuse disorders, regular tobacco use and other risky behaviors and outcomes in adolescence and young adulthood. As GBG is designed to be delivered to all children, the intervention is expected to change the developmental trajectories for youth who initially exhibited aggressive behaviors in first grade, with lesser effects on students who exhibited little or no aggressive behavior.

Life course/social field theory has guided the Baltimore prevention trials. Central to the theory is the concept that at each stage of life, individuals are involved in social fields and are faced with social task demands specific to that social field. Task demands of the first grade classroom include sitting still, paying attention, and obeying classroom rules. Aggressive, disruptive behavior is a maladaptive response to the classroom and teacher. Successful performance at an earlier stage of development is hypothesized to increase the likelihood of success at a later stage in the life course. Further, the theory posits that psychological well-being may be reciprocally and positively related to how successfully an individual meets social task demands.

Theoretical Orientation

Behavioral

Brief Evaluation Methodology

Primary Evidence Base for Certification

Study 1

Outcomes (Brief, over all studies)

Primary Evidence Base for Certification

Study 1

Dolan et al. (1993), Kellam and Rebok (1992), and six other articles reported that the program had a significant impact on aggressive and shy behavior for both males and females as rated by teachers at posttest at the end of first grade. Peer nominations of aggressive behavior by their classmates were also significantly reduced for males. Off-task behavior as obtained through independent observers in the classroom was reduced by the end of first grade, particularly for children doing seat work. These results were mixed, with some significant in comparison to the internal control group and some significant only for the external control group. By both peer and teacher ratings, the more severely aggressive children responded the most to the Good Behavior Game.

At the six-year follow-up, there were increasing effects of GBG for males who were aggressive at the beginning of first grade. GBG had no effect on aggressive behavior for females nor for low aggressive males; these children had much lower rates of aggressive, disruptive behavior at the beginning of first grade.

At 14-year follow-up, GBG was found to have a significant impact on drug abuse/dependence and smoking for males. The program also significantly reduced alcohol abuse/dependence and antisocial personality disorder for males and females combined. These results generally demonstrated the highest impact among those with the highest aggressive behavior in the first grade (as rated by teachers). There were no significant effects on high school graduation, anxiety disorder or depression.

Poduska et al. (2008) found in an examination of young adult service use for problems with emotions, behavior, drugs or alcohol that males were less likely to have used any services and mental or medical provider services (in Cohort 2 only), and social service usage was lower among the combined sample of GBG males and females in Cohort 1. There were no effects on school-based, drug treatment, and juvenile justice services.

Petras et al. (2008) found that Cohort 1 males in the persistent high trajectory (for aggressive, disruptive behavior) showed significantly lower slopes of aggressive, disruptive behavior sustained through seventh grade and lower rates of antisocial personality disorder and violence and criminal behavior by young adulthood compared to controls. Persistent high females showed a significantly lower slope of aggressive, disruptive behavior through the fourth grade only. In Cohort 2, persistent high trajectory boys demonstrated a significant reduction in aggressive, disruptive behavior through seventh grade, but not into young adulthood. There were no significant program effects for girls in Cohort 2.

Wilcox et al. (2008) found that students in Cohort 1, but not Cohort 2, were significantly less likely to have experienced suicide ideation at ages 19-21, as compared to those in the control group. Also for Cohort 1, there was some evidence that GBG students were less likely to have attempted suicide, though these results were mixed depending on the model used.

Kellam et al. (2014) found that the program significantly reduced sexual risk behaviors through young adulthood for males who were aggressive in first grade. Early aggressive behavior is a risk factor for risky sex behaviors that increase the risk of HIV and other sexually transmitted diseases. GBG had a strong impact on reducing unprotected sex and initiating sex before the age of 14 in these early aggressive males.

Outcomes

Primary Evidence Base for Certification

Study 1

The eight reports found the following:

At posttest (end of first grade), the GBG had a significant impact on aggressive and shy behavior for both males and females as rated by teachers (Dolan et al., 1993).
Peer nominations of aggressive behavior by their classmates were significantly reduced at posttest for males (Dolan et al., 1993).
At posttest, for peer and teacher ratings, the more severely aggressive children responded the most to GBG (Dolan et al., 1993).
At the six-year follow-up, there were no main program effects for the total population of males or females, but GBG had an increasing effect on aggression among males at or above the median on aggression in first grade (Kellam et al., 1994).
At 14-year follow-up, GBG, compared to internal controls and all controls, had significant impact on lifetime alcohol abuse/dependence and antisocial behavior among all youth, and on smoking and lifetime illicit drug use among males (Kellam et al., 2008).
In young adulthood, Cohort 2 GBG males were less likely to have used any services or mental and medical health services (Poduska et al., 2008).
GBG students in Cohort 1 at ages 19-21 were significantly less likely to have experienced suicide ideation as compared to those in the control group, and mixed effects, depending upon the model used, were found for suicide attempts (Wilcox et al., 2008).
Cohort 1 and 2 GBG males in the high aggression trajectory showed significantly lower slopes of aggressive, disruptive behavior sustained through 7th grade; sustained for females in the aggression trajectory through grade 4 (Petras et al., 2008).
Cohort 1 GBG males in the high aggression trajectory showed lower rates of antisocial personality disorder and violent and criminal behavior by young adulthood compared to controls (Petras et al., 2008).
Cohort 1 GBG males in the high aggression trajectory, as compared to controls, had a higher prevalence of lifetime condom use, later initiation of vaginal sex, and lower prevalence of a lifetime high risk sexual behavior composite score by ages 19-21 (Kellam et al., 2014).

Generalizability

One study meets Blueprints standards for high-quality methods with strong evidence of program impact (i.e., "certified" by Blueprints): Study 1 (Dolan et al., 1993; Kellam et al., 1992, 1994, 2008, 2014; Petras et al. 2008; Poduska et al., 2008; Wilcox et al., 2008). The sample for the study included first-grade students.

Study 1 took place in Baltimore City, Maryland, and compared the treatment group to a no-intervention control group.

Potential Limitations

Additional Studies (not certified by Blueprints)

Study 2 (van Lier et al., 2004, 2005, 2009; Vuijk et al., 2006, 2007)

There were limitations in the implementation of the program: 1) many teachers found it difficult to emphasize positive behavior and not respond to negative behavior; 2) of the 13 schools, only 9 implemented the GBG program completely; 3) in one school, the GBG was implemented very poorly (only the introduction phase was used). GBG was implemented with adaptations in the Dutch study, e.g., children who displayed poor behavior during the game were not called out.
There was also a limitation in that 17 children were moved from a control class into the intervention class (they were included in the analyses as intervention group participants). There were not enough schools to detect school-level influences on intervention effectiveness. Although the first analysis had only one reporter (teachers implementing the intervention were also rating the students on report forms), subsequent analyses over time used different reporters. Children lost at follow-up tended to have higher teacher-rated ADH problems, ODD problems, and conduct problems at baseline.
Additionally, findings are always among a subgroup of children, and this varies from analysis to analysis. For instance, in one analysis the results are among the intermediate antisocial trajectory of children, while in another analysis results are found in the high antisocial trajectory.

van Lier, P., Huizink, A., & Crijnen, A. (2009). Impact of a preventive intervention targeting childhood disruptive behavior problems on tobacco and alcohol initiation from age 10 to 13 years. Drug and Alcohol Dependence, 100, 228-233.

van Lier, P., Muthen, B., van der Sar, R., & Crijnen, A. (2004) Preventing disruptive behavior in elementary schoolchildren: Impact of a universal classroom-based intervention. Journal of Consulting and Clinical Psychology, 72(3),467-478.

van Lier, P., Vuijk, P., & Crijnen, A. (2005). Understanding mechanisms of change in the development of antisocial behavior: The impact of a universal intervention. Journal of Abnormal Child Psychology, 33(5), 521-535.

Vuijk, P., van Lier, P., Crijnen, A., & Huizink, A. (2007). Testing sex-specific pathways from peer victimization to anxiety and depression in early adolescents through a randomized intervention trial. Journal of Affective Disorders, 100(1), 21-226.

Vuijk, P., van Lier, P., Huizink, A., Verhulst, F., & Crijnen, A. (2006). Prenatal smoking predicts non-responsiveness to an intervention targeting attention-deficit/hyperactivity symptoms in elementary schoolchildren. Journal of Child Psychology and Psychiatry, 47(9), 891-901.

Study 3 (Bradshaw et al., 2009; Furr-Holden et al., 2004; Ialongo et al., 1999, 2001; Storr et al., 2002; Wang et al., 2009)

Relatively small sample with 18 classrooms randomly assigned to three conditions. This evaluation cannot be considered a replication of the Good Behavior Game because it includes the enhanced academic curricula implemented with GBG and the effects of the program cannot be attributed to GBG alone.

Bradshaw, C. P., Zmuda, J. H., Kellam, S. G., & Ialongo, N. S. (2009). Longitudinal impact of two universal preventive interventions in first grade on educational outcomes in high school. Journal of Educational Psychology, 101(4), 926-937.

Furr-Holden, C. D. M., Ialongo, N. S., Anthony, J. C., Petras, H., & Kellam, S. G. (2004). Developmentally inspired drug prevention: Middle school outcomes in a school-based randomized prevention trial. Drug and Alcohol Dependence, 73(2), 149-158.

Ialongo, N. S., Werthamer, L., Kellam, S. G., Brown, C. H., Wong, S., & Lin, Y. (1999). Proximal impact of two first-grade preventive interventions on the early risk behaviors for later substance abuse, depression, and antisocial behavior. American Journal of Community Psychology, 27(5), 599-641.

Ialongo, N., Poduska, J., Werthamer, L., & Kellam, S. (2001). The distal impact of two first-grade preventive interventions on conduct problems and disorder in early adolescence. Journal of Emotional and Behavioral Disorders, 9(3), 146-160.

Storr, C. L., Ialongo, N. S., Kellam, S. G., & Anthony, J. C. (2002). A randomized controlled trial of two primary school intervention strategies to prevent early onset tobacco smoking. Drug and Alcohol Dependence, 66(1), 51-60.

Wang, Y., Browne, D. C., Petras, H., Stuart, E. A., Wagner, F. A., Lambert, S. F., . . . Ialongo, N. S. (2009). Depressed mood and the effect of two universal first grade preventive interventions on survival to the first tobacco cigarette smoked among urban youth. Drug and Alcohol Dependence, 100(3), 194-203.

Study 4 (Barrish et al., 1969)

The sample was very small and there was no control group. It is also questionable whether the results will be sustained over time.

Barrish, H. H., Saunders, M., & Wolf, M. M. (1969). Good behavior game: Effects of individual contingencies for group consequences on disruptive behaviors in a classroom. Journal of Applied Behavior Analysis, 2(2), 119-124.

Study 5 (Medland and Stachnik, 1972)

No control group.

Medland, M. B., & Stachnik, T. J. (1972). Good-Behavior Game: A replication and systematic analysis. Journal of Applied Behavior Analysis, 5(1), 45-51.

Study 6 (Mihalic et al., 2011)

This study used internal control classrooms only, thus there was some potential for contamination. However, it was stated that there was no evidence of control classroom teachers attempting to implement GBG practices.
For girls, differences existed between intervention and control groups on some measures at pretest; these differences were controlled in analyses. Though attrition was substantial by the one-year follow-up (27%), the remaining sample of boys was representative of the pretest sample of boys. The remaining GBG girls sample reflected differences on only two measures when compared to the original full sample of girls.
The TOCA-R measures were provided by teachers who may have interpreted the 6-point scale (almost never through almost always) differently. However, models using scales scores standardized within each teacher produced similar results.

Mihalic, S., Huizinga, D., & Ladika, A. (2011). An evaluation of the Good Behavior Game intervention. Robert Wood Johnson Foundation, Princeton, NJ.

Study 7 (Leflot et al., 2010, 2013)

Two of the outcome measures (peer rejection and aggression) were gathered by asking children to rate their peers, but the study reported no information on the validity of such ratings.
Observers assigning scores for on-task behavior and doing interviews were not blind to the condition.
The study did not attempt to follow students who were retained in grade or moved from the school, perhaps violating intent to treat.
There was some evidence of differential attrition on two outcome variables; even so, the imputation of data for missing children was assumed to be missing at random.
No long-term follow-up data were gathered.
The students and teachers in intervention and control condition in the same school may have interacted, resulting in potential contamination effects.
A directional path from peer relations to the development of aggression is assumed in this study, although causality is likely to also operate in the opposite direction.

Leflot, G., van Lier, P. A., Onghena, P., & Colpin H. (2010). The role of teacher behavior management in the development of disruptive behaviors: An intervention study with the Good Behavior Game. Journal of Abnormal Child Psychology, 38(6), 869-882.

Leflot, G., van Lier, P. A., Onghena, P., & Colpin H. (2013). The role of children's on-task behavior in the prevention of aggressive behavior development and peer rejection: A randomized controlled study of the Good Behavior Game in Belgian elementary classrooms. Journal of School Psychology, 51(2), 187-199.

Study 8 (Spilt et al., 2013; Witvliet et al., 2009)

Baseline equivalence of the treatment and control groups was not fully obtained; children in the control group had lower SES scores and were more often of minority ethnic background.
Large inequalities in subsample sizes compromised statistical power to detect within and between group differences.
Analyses did not take into account the duration and stability of baseline characteristics.
Although three different teachers provided reports on child outcomes, for intervention students two of the three teachers (1st and 2nd grade) also implemented GBG. This may have introduced some bias to the measures.
No follow-up information was provided to determine if effects lasted beyond the treatment period.

Spilt, J. L., Koot, J. M., & van Lier, P. A. C. (2013). For whom does it work? Subgroup differences in the effects of a school-based universal prevention program. Prevention Science, 14(5), 479-488.

Witvliet, M., van Lier, P. A. C., Cuijpers, P., & Koot, H. M. (2009). Testing links between childhood positive peer relations and externalizing outcomes through a randomized controlled intervention study. Journal of Consulting and Clinical Psychology, 77(5), 905-915.

Study 9 (Mitchell et al., 2015)

No comparison group
No posttest student N provided
Measures rated by researchers not blind to condition
Intent-to-treat was not specified, presumably all students in the classroom were included, but one classroom withdrew early and was dropped
Baseline equivalence not discussed
Differential attrition was not discussed, although attrition is likely high as 1 of 3 classrooms withdrew from the study early
Before/after comparisons showed improvement in classrooms, but no significance tests or control classrooms
Sample is small and may not be generalizable

Mitchell, R. R., Tingstrom, D. H., Dufrene, B. A., Ford, W. B., & Sterling, H. E. (2015). The effects of the Good Behavior Game with general-education high school students. School Psychology Review, 44(2), 191-207.

Study 10 (Humphrey et al., 2018; Ashworth, Humphrey et al., 2020; Ashworth, Panayiotou et al., 2020; Troncoso & Humphrey (2021)

Some student measures came from teachers who delivered the program
Some evidence of differential attrition
No posttest effects on child or teacher outcomes

Humphrey, N., Hennessey, A., Ashworth, E., Frearson, K., Black, L., Petersen, K., . . . Pampaka, M. (2018). Good Behaviour Game: Evaluation report and executive summary. London: Education Endowment Foundation.

Ashworth, E., Humphrey, N., & Hennessey, A. (2020). Game over? No main or subgroup effects of the Good Behavior Game in a randomized trial in English primary schools. Journal of Research on Educational Effectiveness, 13(2), 298-321. doi:10.1080/19345747.2019.1689592

Ashworth, E., Panayiotou, M., Humphrey, N., & Hennessey, A. (2020). Game on-complier average causal effect estimation reveals sleeper effects on academic attainment in a randomized trial of the Good Behavior Game. Prevention Science, 21(2), 222-233. doi:10.1007/s11121-019-01074-6

Troncoso, P., & Humphrey, N. (2021). Playing the long game: A multivariate multilevel non-linear growth curve model of long-term effects in a randomized trial of the Good Behavior Game. Journal of School Psychology, 88, 68-84.

Study 11 (Wilcox et al., 2022)

Teachers who delivered the program provide some of the child measures
Adjusted for clustering but the level-2 sample of 24 may be too small
No tests for baseline equivalence using sociodemographic measures
Program effects only for gender, trajectory, cohort interactions
Program effects at posttest (with non-independent ratings) are not separated from program effects at long-term (with independent ratings)

Wilcox, H. C., Petras, H., Brown, H. C., & Kellam, S. G. (2022). Testing the impact of the whole-day Good Behavior Game on aggressive behavior: results of a classroom-based randomized effectiveness trial. Prevention Science, 23(6), 907-921

Notes

It should be noted that while the Good Behavior Game has achieved encouraging results sustained for at least one year in separate, independent evaluations, the most significant of these results appear to be for males who entered first grade with high levels of aggressive, disruptive behavior. Girls and boys demonstrating less severe aggressive behavior do benefit from the Good Behavior Game, but their benefits are not as pronounced as those seen with more severely aggressive males.

The Good Behavior Game has also been used as a component of the LIFT program, and this program has been found to reduce participants' physical aggression, association with deviant peers, alcohol, and substance use. In the LIFT program, participants receive a combination of classroom curriculum (stressing social and problem-solving skills), parenting skills classes, and a slightly modified version of the GBG. In this version, participants are divided into small groups, and team members earn individual and group rewards for refraining from destructive/aggressive behaviors and displaying prosocial behaviors on the playground. An evaluation of both short- and long-term effects of the LIFT program found that LIFT students displayed fewer acts of physical aggression on the playground (with the most dramatic reductions found for the most aggressive students), and LIFT students were significantly less likely to be arrested 30 months after the program ended. While the effects of the GBG component on participants cannot be isolated from the effects of the other components, these findings help reinforce the positive outcomes found in previous evaluations of the GBG program, as well as emphasize that the program can be generalized to other settings, such as the playground.

Note on Study 2:
This study previously included the following article, but it was removed from this database. The article was retracted from the publishing journal by agreement between the authors and Journal Editor-in-Chief due to concerns surrounding accuracy of some data.

Vuijk, P., van Lier, P., Huizink, A., Verhulst, F., & Crijnen, A. (2006). Prenatal smoking predicts non-responsiveness to an intervention targeting attention-deficit/hyperactivity symptoms in elementary schoolchildren. Journal of Child Psychology and Psychiatry, 47(9), 891-901.

Endorsements

Blueprints: Promising
Crime Solutions: Effective
OJJDP Model Programs: Effective
SAMHSA : 3.1-3.2

Program Information Contact

Megan Sambolt
American Institutes for Research
Phone: (202) 403-5223
Email: msambolt@air.org
Website: www.air.org/resource/spotlight/good-behavior-game

References

Study 1

Certified Dolan, L. J., Kellam, S. G., Brown, C. H., Werthamer-Larsson, L., Rebok, G. W., Mayer, L. S., . . . Wheeler, L. (1993). The short-term impact of two classroom-based preventive interventions on aggressive and shy behaviors and poor achievement. Journal of Applied Developmental Psychology, 14, 317-345.

Kellam, S. G., & Rebok, G. W. (1992). Building developmental and etiological theory through epidemiologically based prevention intervention trials. In J. McCord & R. E. Tremblay (Eds.), Preventing antisocial behavior (pp. 162-195). NY: The Guilford Press.

Certified Kellam, S. G., Brown, C. H., Poduska, J., Ialongo, N., Wang, W., Toyinbo, P., . . . Wilcox, H. (2008). Effects of a universal classroom behavior management program in first and second grades on young adult behavioral, psychiatric, and social outcomes. Drug and Alcohol Dependence, 95(Suppl 1), 5-28.

Kellam, S. G., Rebok, G. W., Ialongo, N., & Mayer, L. S. (1994). The course and malleability of aggressive behavior from early first grade into middle school: Results of a developmental epidemiologically-based preventive trial. Journal of Child Psychology and Psychiatry, 35(2), 259-282.

Kellam, S. G., Wang, W., Mackenzie, A. C. L., Brown, C. H., Ompad, D. C., Or, F., . . . Windham, A. (2014). The impact of the Good Behavior Game, a universal classroom based preventive intervention in first and second grades, on high risk sexual behaviors and drug abuse and dependence disorders in young adulthood. Prevention Science, 15(Suppl 1), S6-S18.

Petras, H., Kellam, S. G., Brown, C. H., Muthen, B. O., Ialongo, N. S., & Poduska, J. M. (2008). Developmental epidemiological courses leading to antisocial personality disorder and violent criminal behavior: Effects by young adulthood of a universal preventive intervention in first- and second-grade classrooms. Drug and Alcohol Dependence, 95(Suppl 1), 45-59.

Certified Poduska, J. M., Kellam, S. G., Wang, W., Brown, C. H., Ialongo, N. S., & Toyinbo, P. (2008). Impact of the Good Behavior Game, a universal classroom-based behavior intervention, on young adult service use for problems with emotions, behavior, or drugs or alcohol. Drug and Alcohol Dependence, 95(Suppl 1), 29-44.

Certified Wilcox, H. C., Kellam, S. G., Brown, C. H., Poduska, J. M., Iallongo, N. S., Wang, W., & Anthony, J. C. (2008). The impact of two universal randomized first- and second-grade classroom interventions on young adult suicide ideation and attempts. Drug and Alcohol Dependence, 95(Suppl 1), S60-S73.

Study 2

van Lier, P., Muthen, B., van der Sar, R., & Crijnen, A. (2004) Preventing disruptive behavior in elementary schoolchildren: Impact of a universal classroom-based intervention. Journal of Consulting and Clinical Psychology, 72(3),467-478.

van Lier, P., Vuijk, P., & Crijnen, A. (2005). Understanding mechanisms of change in the development of antisocial behavior: The impact of a universal intervention. Journal of Abnormal Child Psychology, 33(5), 521-535.

Vuijk, P., van Lier, P., Crijnen, A., & Huizink, A. (2007). Testing sex-specific pathways from peer victimization to anxiety and depression in early adolescents through a randomized intervention trial. Journal of Affective Disorders, 100(1), 21-226.

Study 3

Furr-Holden, C. D. M., Ialongo, N. S., Anthony, J. C., Petras, H., & Kellam, S. G. (2004). Developmentally inspired drug prevention: Middle school outcomes in a school-based randomized prevention trial. Drug and Alcohol Dependence, 73(2), 149-158.

Ialongo, N. S., Werthamer, L., Kellam, S. G., Brown, C. H., Wong, S., & Lin, Y. (1999). Proximal impact of two first-grade preventive interventions on the early risk behaviors for later substance abuse, depression, and antisocial behavior. American Journal of Community Psychology, 27(5), 599-641.

Ialongo, N., Poduska, J., Werthamer, L., & Kellam, S. (2001). The distal impact of two first-grade preventive interventions on conduct problems and disorder in early adolescence. Journal of Emotional and Behavioral Disorders, 9(3), 146-160.

Storr, C. L., Ialongo, N. S., Kellam, S. G., & Anthony, J. C. (2002). A randomized controlled trial of two primary school intervention strategies to prevent early onset tobacco smoking. Drug and Alcohol Dependence, 66(1), 51-60.

Wang, Y., Browne, D. C., Petras, H., Stuart, E. A., Wagner, F. A., Lambert, S. F., . . . Ialongo, N. S. (2009). Depressed mood and the effect of two universal first grade preventive interventions on survival to the first tobacco cigarette smoked among urban youth. Drug and Alcohol Dependence, 100(3), 194-203.

Study 4

Study 5

Medland, M. B., & Stachnik, T. J. (1972). Good-Behavior Game: A replication and systematic analysis. Journal of Applied Behavior Analysis, 5(1), 45-51.

Study 6

Mihalic, S., Huizinga, D., & Ladika, A. (2011). An evaluation of the Good Behavior Game intervention. Robert Wood Johnson Foundation, Princeton, NJ.

Study 7

Leflot, G., van Lier, P. A., Onghena, P., & Colpin H. (2013). The role of children's on-task behavior in the prevention of aggressive behavior development and peer rejection: A randomized controlled study of the Good Behavior Game in Belgian elementary classrooms. Journal of School Psychology, 51(2), 187-199.

Study 8

Witvliet, M., van Lier, P. A. C., Cuijpers, P., & Koot, H. M. (2009). Testing links between childhood positive peer relations and externalizing outcomes through a randomized controlled intervention study. Journal of Consulting and Clinical Psychology, 77(5), 905-915.

Study 9

Study 10

Study 11

Study 1

Summary

Dolan et al. (1993), Kellam and Rebok (1992), and six other articles reported on a randomized control trial that examined 19 schools in five different urban areas of Baltimore City, Maryland. The schools were assigned to one of three conditions: the Good Behavior Game, Mastery Learning, or an external no-intervention control group. Each intervention school's classrooms and first-grade classroom teachers were randomly assigned to either an intervention or to serve as an internal control class (receiving no intervention). The full trial involved 42 classrooms in 19 schools. A cohort of 1,084 first-grade children was assessed at baseline from among 1,197 children available for participation. The short-term impact of the GBG on aggression and shyness was assessed at posttest, then annually through middle school, and the long-term impact of the GBG was examined in a 14-year follow-up study. While the first cohort of students continued implementation in Grade 2, a new cohort of students entered the first grade and was also assigned to conditions for evaluation purposes. Cohort 2 consisted of 1,117 first-grade children.

The eight reports found the following:

At posttest (end of first grade), the GBG had a significant impact on aggressive and shy behavior for both males and females as rated by teachers (Dolan et al., 1993).
Peer nominations of aggressive behavior by their classmates were significantly reduced at posttest for males (Dolan et al., 1993).
At posttest, for peer and teacher ratings, the more severely aggressive children responded the most to GBG (Dolan et al., 1993).
At the six-year follow-up, there were no main program effects for the total population of males or females, but GBG had an increasing effect on aggression among males at or above the median on aggression in first grade (Kellam et al., 1994).
At 14-year follow-up, GBG, compared to internal controls and all controls, had significant impact on lifetime alcohol abuse/dependence and antisocial behavior among all youth, and on smoking and lifetime illicit drug use among males (Kellam et al., 2008).
In young adulthood, Cohort 2 GBG males were less likely to have used any services or mental and medical health services (Poduska et al., 2008).
GBG students in Cohort 1 at ages 19-21 were significantly less likely to have experienced suicide ideation as compared to those in the control group, and mixed effects, depending upon the model used, were found for suicide attempts (Wilcox et al., 2008).
Cohort 1 and 2 GBG males in the high aggression trajectory showed significantly lower slopes of aggressive, disruptive behavior sustained through 7th grade; sustained for females in the aggression trajectory through grade 4 (Petras et al., 2008).
Cohort 1 GBG males in the high aggression trajectory showed lower rates of antisocial personality disorder and violent and criminal behavior by young adulthood compared to controls (Petras et al., 2008).
Cohort 1 GBG males in the high aggression trajectory, as compared to controls, had a higher prevalence of lifetime condom use, later initiation of vaginal sex, and lower prevalence of a lifetime high-risk sexual behavior composite score by ages 19-21 (Kellam et al., 2014).

Evaluation Methodology

Design: (Dolan et al., 1993; Kellam & Rebok, 1992).
Nineteen schools in five different urban areas of Baltimore City, Maryland were selected for the program during the 1985-86 academic year. The areas varied in ethnicity, SES, type of housing, family structure, and stability. Three or four similar schools (matched on students' achievement levels, family SES and ethnicity) within each of the urban areas were assigned to one of three conditions: the Good Behavior Game (GBG) intervention, the Mastery Learning (ML) intervention (a strengthened reading curriculum), and the remaining school(s) to an external control condition with no intervention. In order to control for spillover effects, each intervention school's classrooms were randomly assigned to either the intervention or to serve as internal control classes (receiving no intervention). Children entering first grade were randomly assigned to classrooms. Teachers were also randomly selected and assigned. A cohort of 1,197 first grade children was enrolled, and 1,084 students were assessed at baseline. Originally, there were 42 classrooms, although later reports state 40 and 41 classrooms (depending upon the source).

To be included in the analyses, all students had to remain in the same design condition for the entire year. For the Good Behavior Game condition, a total of 501 students met this condition (182 students from 8 GBG classrooms; 107 students from 6 internal control classrooms; 212 from 12 external control classrooms). (For Mastery Learning, 575 students remained in the same design condition for 1 year--9 ML classrooms and 7 ML internal control classrooms).

6th Grade Follow-up (Kellam et al., 1994): Of a total of 1,084 children assessed at baseline, 693 children participated in the same intervention or control condition for two years, and of those 693, 590 children were assessed at the six-year follow-up. Comparisons between children assessed and those not assessed showed significant differences in teacher-rated, first grade aggressive behavior as well as several other variables.

14-year Follow-up (Kellam et al., 2008): Participants were contacted again at ages 19-21 for a 14-year follow-up. Participants completed a 90-minute telephone interview for which they received $50 and a t-shirt. Interviewers were blind to the intervention condition of the respondents. Only those who had participated as GBG intervention students or internal GBG controls were contacted (n=922); this includes the 238 GBG children who were assigned to 8 GBG classes in 6 schools and 169 internal GBG controls from 6 classes in 6 schools. Of the original 922 randomized students, 689 completed the 14-year follow-up (75% retention).

The follow-up was also conducted with a second cohort. While the first cohort was in the 2nd year of the GBG intervention, incoming first-graders were assigned to classrooms in the same manner as Cohort 1. The emphasis of the overall intervention was placed on the new training and supervision of second-grade teachers in the first cohort implementation, however. Therefore, teachers who had implemented the program to the first cohort were not retrained and did not receive any new training. There is no baseline or posttest data available for this cohort.

Sample Characteristics: The original sample consisted of 1,197 children assigned to one of three conditions: the Good Behavior Game, Mastery Learning, or the external control condition. Children ranged in age from 5 to 9.7 years (mean age of 6.6 years) who entered the first grade during the 1985-1986 academic year. The sample was 49% males. Sixty-five percent of the sample was African American, 31% Caucasian, and 4% represented other ethnic groups.

Measures: Assessments of children's social adaptational status include teachers' ratings of how well each child was meeting classroom task demands and peer nominations of the children's status in regard to peer relationships. Structured interviews with teachers were held at baseline (Fall 1985) and at each follow-up (conducted in Spring 1986, Fall 1986, Spring 1987, Spring 1988, Spring 1989, Spring 1990, and Spring 1991) and utilized the Teacher Observation of Classroom Adaptation-Revised (TOCA-R). The TOCA-R assesses children's social contact/shy behavior (social participation in classroom processes), authority acceptance/aggressive behavior (how well the child accepts the teacher's and school's rules, and/or how often the student breaks rules, fights, and harms property), and concentration (readiness to work and paying attention). The Peer Assessment Inventory is a classroom-administered, modified version of the Pupil Evaluation Inventory (PEI) which allows children to rate their peers' aggressive behavior, shy behavior, and likability, as well as rejection and bullying (added after the first year). Assessment of students' psychological well-being included self-report first-stage measures of depressive and anxious symptoms. Direct observations of classroom behavior, as well as achievement test scores from the California Achievement Test were also used.

At the 14-year follow-up, measures were adjusted to be applicable to young adults. The Composite International Diagnostics Interview - University of Michigan version (CIDI-UM) was used to determine lifetime, past year, and past month occurrence of major depressive disorder, generalized anxiety disorder, illicit drug abuse/dependence, and alcohol abuse/dependence. Researchers added in measures for regular use of tobacco, including number of cigarettes used per day, and the instrument allowed researchers to obtain lifetimes rates of attention-deficit/hyperactivity disorder, conduct disorder, and antisocial personality disorder. The Young Adult's Educational History instrument was also used to gauge highest level of schooling obtained, the number of repeated grades from K-12, how well the respondent performed overall in school, whether they were currently in school/training and, if so, how well they were performing, and the nature of the educational program they were currently attending.

Analysis: The first year impact was tested with analyses of covariance (ANCOVA). The six-year impact of the GBG was tested by gender, also using ANCOVA, first comparing the total male children and then the total female children who received GBG to the GBG and ML internal controls, the external controls, and the ML children. At the fourteen-year follow-up, several tests were run to try to account for blocking of intervention classrooms within schools, including two-level modeling of child and classroom-level effects.

Outcomes

Baseline Equivalence and Attrition (Kellam et al., 2008): Taking school into account as a random factor, there were no significant differences between intervention and control groups on teacher-rated aggressive behavior, fall-of-first-grade achievement, or free or reduced-price school lunch. Without using random effects of school or classroom, groups were quite comparable, with the exception of a significant effect for depressive symptoms. There were no significant differences on teacher-rated aggression, shyness or concentration problems, self-reported anxiety or depression, reading and math achievement, the proportion receiving free or reduced-price school lunch, or on classroom size.

Five percent of parents refused to have their children assessed by the TOCA-R, and 10% of the sample was missing baseline teacher ratings. Only 10% of these were internal GBG controls, while 3% were from GBG intervention classrooms, but this is a significant difference between these two groups. The vast majority of the missing data came from a single school. The 826 students who received teachers' ratings consisted of 231 (97%) of 238 children assigned to the GBG intervention, 149 (88%) of the 169 in the internal GBG control classrooms, 191 (93%) of the 205 assigned to internal ML control classrooms, and 255 (82%) of 310 external controls. External controls had significantly higher attrition rates at this point. Adjusting for school at the end of first grade, rates of completion of teachers' ratings were not significantly different from baseline for GBG (78% completion) and internal GBG controls (71% completion).

Posttest (Dolan et al., 1993): At the end of the first grade, GBG significantly reduced aggressive behavior, as rated by teachers, for males compared to external controls, and for females compared to internal controls. Shy behavior as rated by teachers at the end of the first grade was significantly reduced for males compared to internal controls, and for females compared to both external and internal controls. Peer nominations of aggressive behavior by their classmates were also significantly reduced for GBG boys compared to internal controls, but not external controls. Peer nominations for girls showed no significant differences. GBG did not impact peer nominations of shy behavior. For both peer and teacher ratings, it appears that the more severely aggressive children responded the most to the Good Behavior Game, but this appears to be based upon a very small number of children.

Six year follow-up (Kellam et al., 1994): At the six year follow-up, 590 of the 693 children who participated in the same intervention or control condition for two years were assessed. The characteristics of children at pre-intervention who had been assessed at each follow-up were compared with those who were not assessed each time. Significant differences in baseline scores between the two groups were found for teacher-rated aggressive behavior. The sample for the GBG treatment was 153 from eight classrooms; the GBG internal control condition was 86 from six classrooms; and the sample for the external control condition was 157 from eleven classrooms.

There were no main program effects of GBG compared to the combined control group for the total population of males or for the total population of females. Next, impact as a function of the level of pre-existing aggression was tested. There were increasing significant effects of GBG at sixth grade for the higher levels of aggression at fall of first grade. The GBG had increasing effects as the level of aggression rose in the fall of first grade, but only among males, and only among males at or above the median on aggression in first grade. There were no subgroup effects for females, perhaps since teacher ratings reveal significantly lower levels of aggression for females when compared to males for all six years. Although Kellam et al. had hypothesized that GBG students who were not aggressive in the fall of first grade would exhibit a lower incidence of aggressive behavior over the six years than non-aggressive children in the control or ML groups, regression analysis refuted this hypothesis: those students receiving GBG who were not aggressive at the start were not protected from becoming aggressive.

14-Year Follow-up (Kellam et al., 2008): Females were significantly more likely than males to be interviewed at follow-up and rates of follow-up differed by first-grade urban area. Intervention status was not related to attrition, nor were baseline teacher-ratings, self-reported psychiatric symptoms, free lunch status, and achievement scores. There was also no difference, controlling for school, between intervention and control groups on overall missing data. Of the 689 (75%) students retained at 14-year follow-up, there were 126 of 169 randomized internal GBG controls, 183 of 238 randomized GBG intervention students, 227 of 310 external controls, and 153 of 205 internal ML controls.

In the analyses which adjust for classroom effects, GBG young adult males in Cohort 1 reduced their lifetime drug abuse/dependence significantly more than male internal controls, who were at 2.7 times greater risk for drug abuse/dependence. There was no effect on this outcome for females. The program reduced the odds of a lifetime alcohol abuse/dependence diagnosis by 50% for GBG Cohort 1 male and female students combined. The finding for males on lifetime drug abuse/dependence was replicated in the second cohort of students, but not for alcohol abuse/dependence.

Controlling for classroom effects and individual-level risk, GBG revealed a significant reduction in the probability of males smoking greater than or equal to 10 cigarettes/day, and the effect of GBG is greater for boys with higher levels of aggressive, disruptive behavior in the first grade. There is no effect for girls and no effects in the second cohort. Examining smoking outcomes with within-school variation and small-sample testing revealed non-significant effects for males and females combined, while reductions in risk for smoking were marginally significant for intervention males versus internal control males (p = .06) and for highly aggressive males and females combined (p = .06). Program effects on high-risk males were significant.

Overall rates of antisocial personality disorder (ASPD) were significantly lower for youth in the GBG groups (17%) than they were for internal controls (25%) and all controls (25%). The GBG benefited students at both the upper and lower ends of aggressive behavior at baseline. Within-school variation and small-sample testing did not reveal a significant association between the program and a lower probability of ASPD for males and females combined or for males only or females only.

There were no program effects on high school graduation, lifetime generalized anxiety disorder, or lifetime major depressive disorder.

Cohort 2: Although these results are reported above, they are repeated here because only one of the outcomes was replicated in the second cohort. Marginally significant program effects on drug abuse/dependence were found for GBG males who were less likely to report drug abuse/dependence issues than internal GBG controls (p=.10). Program effects on alcohol abuse, smoking, antisocial personality disorder, were not significant and, thus, not replicated.

Service Use at Ages 19-21 (Poduska et al., 2008): The Services Assessment of Children and Adolescents (SACA) was conducted at the young adult follow-up. The SACA obtains information on child and adolescent mental health service utilization, past and present use of mental health and educational services, including the setting (e.g., outpatient, inpatient, school-based, primary care, juvenile justice system). These young adults were asked if they had received services from about 26 different providers for problems with behaviors, feelings, drugs or alcohol. Cumulative use of services was assessed in this paper. The sample for this analysis contained 689 (75%) of the 922 students in Cohort 1 and 656 (76%) of the 867 students in Cohort 2 who were in intervention conditions relevant to this paper (GBG, GBG internal control, ML internal control, and external controls). Overall, males were less likely to have used any services and mental health and medical services (Cohort 2 only). The only other significant finding was that males and females combined were less likely to use social services than internal controls (Cohort 1 only). School-based, drug treatment, and juvenile justice services showed no significant main effect differences.

Antisocial personality disorder (ASPD) and violent and criminal behavior by age 19-21 were examined (Petras et al., 2008). Self-reports and juvenile court and adult incarceration records were included in the young adult follow-up. This paper reported on the impact of the GBG on the course of aggressive, disruptive behavior from first to seventh grade and on the young adult outcomes of ASPD and criminal behavior in young adulthood (ages 19-21). Aggressive, disruptive behavior was measured using the TOCA-R. ASPD diagnosis was measured using a scale developed and administered to determine whether the participant met the DSM-IV criteria for ASPD including a history of conduct disorder. Juvenile court and adult incarceration records were obtained at the time of the young adult follow-up and yearly searches were conducted thereafter. Records of incarceration for an offense classified as a felony were used as an indicator of violent and criminal behavior as an adult. Four categories were developed from the ASPD diagnosis and the records of violent and criminal behavior: an ASPD diagnosis and criminal behavior, only a diagnosis of ASPD, only a record of violent and criminal behavior, or neither outcome by young adulthood.

In general, three trajectories of aggressive, disruptive behavior were noted for males: the persistent high group started with high levels of aggressive, disruptive behavior that rose through third and fourth grade, and then decreased somewhat by the end of seventh grade; the escalating medium group demonstrated initial moderate levels of aggressive, disruptive behavior that increased over elementary school; and the stable low group started and remained at a low level of aggressive behavior. Girls followed similar trajectories with the second trajectory being a persistent medium group (rather than an escalating medium group). In Cohort 1, boys in the persistent high group displayed a lower slope on aggressive, disruptive behavior compared to controls (sustained until the fourth grade); significant reductions in the prevalence of ASPD; and significantly fewer had both ASPD and a record of violent criminal behavior or ASPD alone compared to control boys in the persistent high group. Cohort 1 females in the persistent high group showed a significantly lower slope than the persistent high control females, but this effect was not sustained through seventh grade. The GBG did not significantly reduce the risk for later ASPD for females compared to controls in any of the three groups. For Cohort 2 boys, there was a significant overall lower growth of aggressive disruptive behavior in the persistent high and the stable low groups compared to controls. Cohort 2 boys in the escalating medium and stable low groups showed an increase in the prevalence of ASPD compared to controls (a negative effect), although this finding did not differ significantly from no effect when tested against a model of zero impact. There were no significant program effects for girls in Cohort 2.

Young Adult Suicide Ideation and Attempts (Wilcox et al., 2008): Survival analysis methods were used to estimate the relative hazard of suicide ideation (SI) and attempt for both ML and the GBG, followed by covariate adjustment for gender, race, baseline aggressive, disruptive behavior, clinical features of depression and anxiety, parental suicidality and psychiatric disturbance. All survival analyses measured the time to first suicide ideation (or attempt) since entry into first grade. For those who did not report suicidality, the censoring time was age at last assessment. Students in Cohort 1, but not Cohort 2, were significantly less likely to have experienced SI as compared to those in the control group. There were mixed effects for suicide attempts, with the overall summary relative risk estimate significant and 1 of three models with covariates significant. The overall summary relative risk estimate for occurrence of SI indicates a significant inverse (protective) association with assignment to the GBG intervention, as well as a consistent and robust inverse association of the GBG intervention with SI across the series of models adjusting for gender, race, baseline levels of aggressive, disruptive behavior, depression, and anxiety as measured in the fall of first grade, caregiver suicidality and mental illness. The overall summary relative risk estimate for occurrence of SI for Cohort 1 is consistent with a significant, inverse association with assignment to the GBG. Caregiver SI had a strong and consistent association with offspring's suicide ideation, especially among males, and retrospective reports on caregiver mental illness and suicide threats were also related to offspring suicide attempts by young adulthood.

Risky Sexual Behaviors at Ages 19-21 (Kellam et al, 2014): Measures of risky sexual behavior included lifetime use of condoms, lifetime number of sexual partners and age of initiation of vaginal sex. Condom use was described by responses of using a condom "always", "most of the time", or "often" during vaginal sex. Not using a condom included responses of "never", "rarely", or "sometimes." A composite high risk sexual behavior score was defined as meeting one of the following criteria: a) age of 13 years or younger for vaginal sex initiation, b) having ten or more lifetime vaginal sex partners, and/or c) not using condoms (i.e. responded "never", "rarely", or "sometimes" on lifetime condom use). General growth mixture models were used to examine the long-term impact of GBG on risky sexual behaviors, and whether the intervention impact varied by the individual's initial and developmental course of aggressive, disruptive behavior. Overall, no significant differences were found between the GBG and control groups on any of the risky sexual behaviors. However, for cohort 1 males in the persistent high group of aggressive, disruptive behavior (described above), the GBG males, in comparison to the control group males, had a higher prevalence of lifetime condom use (89.7% vs. 40.8%, p=.012), later initiation of vaginal sex (14.4 years vs. 12.3 years, p=.035), and a lower lifetime high risk sexual behavior composite score (52.5% vs. 89.1%, p=.036). Among the persistent high-group males, GBG reduced the prevalence of lifetime drug abuse and dependence disorders in young adulthood. There was no significant impact for females.

Study 2

This study measures the overall impact of the Good Behavior Game on developmental trajectories in young elementary school children with attention-deficit/hyperactivity (ADH) and oppositional defiant disorder (ODD) problems in the Netherlands. Three trajectories were identified in children with high, intermediate, or low levels of problems on all 3 disruptive behaviors at baseline.

In the first intervention year, the GBG was implemented in three different stages. In the introduction stage, the GBG was played three times a week for approximately 10 minutes, with the goal of acquainting children and teachers with the GBG; the introduction phase lasted for about 2 months. In the expansion stage, teachers were encouraged to expand the duration of the GBG (up to three 1-hour sessions per week), expand the settings in which the GBG was played, and expand the behaviors targeted by the GBG. Rewards were delayed until the end of the week and month; the expansion phase lasted until the early spring of the school year. In the final phase, the generalization phase, attention was focused on promoting prosocial behavior outside GBG moments by explaining to children that the rules used during the GBG were also applicable when the game was not in process. Children received compliments for appropriate behavior by their teachers. The GBG sessions were used as a booster. The same three phases were used in the second intervention year, however, because children were already familiar with the GBG, teachers swiftly moved to the expansion and generalization phase. There were two adaptations to the GBG: teams did not compete for weekly winners and children violating the GBG rules were not mentioned by teachers.

Summary

Van Lier et al. (2004) and four other articles reported on an independent evaluation in the Netherlands. Classes within each of 13 schools were randomly appointed to either the intervention or control condition. Seven hundred forty-four children were eligible for inclusion; parents provided consent for 666 (89.5%). Of the 31 classes in the 13 schools, 16 became intervention and 15 became control classes, resulting in 363 children in the GBG program and 303 children in the control classes.

The results showed the following:

Among children classified with high levels of ADH/ODD problems, GBG children showed a marginally significant decrease in conduct problems compared to controls at the end of third grade.
Among children classified with intermediate levels of ADH/ODD problems, GBG children had a significantly better development than their control group counterparts on ADH, ODD, and conduct problems at the end of third grade.
Among children on a high antisocial behavior trajectory, GBG children, compared to controls, had large reductions at age 10 in the level of antisocial behavior.
GBG reduced levels of physical and relational victimization at age 10, as well as depression, anxiety, and panic agoraphobia at age 13.
GBG lowered the onset of tobacco use from age 10 to13 years and weekly alcohol use, but not monthly or yearly alcohol use.

Evaluation Methodology

Design: In the spring of 1999, 13 schools in the metropolitan areas of Rotterdam and Amsterdam, the Netherlands, were recruited into the study. At the start of the trial, each of the 13 schools had at least two Grade 1 classes. During the summer vacation between first and second grade, classes within each school were randomly appointed to either the intervention or control condition. Seven hundred forty-four children were eligible for inclusion; parents provided consent for 666 (89.5%). Of the 31 classes in the 13 schools, 16 became intervention and 15 became control classes, resulting in 363 children in the GBG program and 303 children in the control classes. Shortly after summer vacation, teachers were instructed about the GBG intervention that started in the fall of Grade 2. There was some contamination after the first year of the intervention when one school realigned classes, resulting in 17 children moving from a control class to an intervention class. In the analysis, they were included in the intervention group.

Sample: The final sample of children was described as being 69% Caucasian, 10% Turkish, 9% Moroccan, 5% Surinam-Dutch Antilles, and 7% were from other ethnic groups. Fifty-one percent of the children were male and their mean age was 6.9 years. Ninety-two children were lost to follow-up. Losses were not related to gender or to intervention status of the child. However, the children lost to follow-up had higher teacher-rated ADH problems, ODD problems, and conduct problems at baseline. The available data for these children were included in the analyses.

Measures: Children's problem behaviors were rated with the Teacher's Report Form (TRF), which contains a list of 120 behavior items. Problem behavior at school was assessed with the Problem Behavior at School Interview (PBSI). The PBSI is a 32-item teacher interview that assesses disruptive behavior and shy-withdrawn behavior in children. The ADH Problems Scale measures ADH behavior and consists of eight items. Items include "This child has difficulty with concentration," "This child is impulsive," and "This child finds it hard to sit still." The ODD Problems Scale consists of eight items, which include "This child argues frequently" and "This child disobeys teachers' instructions." The Conduct Problems Scale consists of 13 items, which include "This child fights," "This child attacks other children physically," and "This child is truant."

Teacher assessments at baseline were conducted in the spring (T1) and early summer (T2) of Grade 1. During intervention, a 12-month assessment (T3; end of 1st year of intervention), 18-month assessment (T4), and 24-month assessment (T5; end of 2nd year of intervention) were conducted. At the preintervention (T1 and T2), 12-month (T3), and 24-month assessments (T5), the TRF/6-18 was completed for all students by the teachers. At the 18-month and 24-month assessments, teachers were interviewed at school with the PBSI by trained research assistants. Interviews were completed for all children attending these teachers' classes.

Analysis: The overall effect of the GBG intervention on the developmental trajectory of ADH problems was determined in a multiple group analysis. Growth mixture modeling (GMM) was used to determine the number of developmental trajectories needed to describe the data in the control and intervention groups separately. Lastly, to analyze the effects of the intervention on the development of attention-deficit/hyperactivity (ADH) problems, oppositional defiant disorder (ODD) problems and conduct problems, GMM was incorporated into a more general framework, general growth mixture modeling. In this framework, the slope of the developmental trajectories is regressed on intervention status.

Outcomes

The first analysis examined the overall impact of the program followed by analyses of this impact on groups of children differing in developmental trajectories of ADH problems. Multiple group analysis was used to assess whether there was an overall GBG intervention effect on the development of ADH problems. On average, children in the control classes followed a significantly different developmental trajectory of teacher-rated ADH problems than did children in the intervention classes, with the control group characterized by an increase in the level of problems over the intervention period, and intervention children showing a decrease in the level of problems.

Next, classes of children following different trajectories were identified, with three classes found. Children in Class 1 had the highest probabilities of all children for having any ADH symptoms at baseline; children in Class 2 had intermediate probabilities, and children in Class 3 had the lowest probabilities. The impact of the intervention on conduct problems and ODD problems was then examined for each class at the end of the third grade (after intervention in grades 2 and 3).

Class 1 Children (High ADH/ODD Problems): Fourteen percent of the children were classified in Class 1, 78% of whom were boys. Class 1 children were characterized by high levels of ADH problems in Grade 1 (baseline), followed by a significant decrease in ADH problems over the intervention period. The regression coefficient of GBG on the slope was not significant, indicating that the decline in ADH problems was similar for control and intervention children. Children classified in Class 1 as having ADH problems also had the highest comorbid ODD and conduct problems in Grade 1. As found in children with ADH problems, the developmental trajectory of ODD problems was similar for intervention as for control group children. Although non-significant, there was a trend indicating lower levels of conduct problems for intervention children.

Class 2 Children (Intermediate ADH/ODD Problems): One hundred seventy-six children (26%) were classified in Class 2, of which 62% were boys. Class 2 children had intermediate levels of ODD problems and conduct problems at baseline. The coefficient of GBG on this slope was negative and significant, indicating that Class 2 intervention children had significantly better development than their control group counterparts. This indicates that the increase in levels of ADH problems found for Class 2 control children was not found in Class 2 children receiving the GBG intervention. The finding that Class 2 intervention children had a significant different developmental trajectory on ADH problems was substantiated by their development on both conduct and ODD problems. Control group children showed an increase in levels of ODD problems and conduct problems and this was not found in children in the GBG program.

Class 3 Children (Low ADH/ODD Problems): The remaining 398 children were in Class 3, of which 42% were boys. Low levels throughout the intervention period characterized the developmental trajectory of ADH problems in these children, and no differences were found between the intervention and control children. Class 3 children had low levels of comorbid conduct problems and ODD problems, which was the same for control group children and children in the GBG program. There were no significant differences found between these groups.

van Lier, Vuijk, and Crijnen, 2005

Six hundred forty-four children were included in these analyses, as two children were excluded because they moved away from a study school before participating in the peer assessments. Growth modeling was used to describe the development of antisocial behavior and friends' antisocial behavior for intervention and control children. MANOVA was used to test for overall differences as a function of gender and intervention.

In the analysis of the Netherlands sample, the GBG students had significantly lower mean scores at age 10 than control students on peer-nominated relational bullying, self-report victimization (overt and relational), and self-reported anxiety/depression problems at age 11 years. There were no differences on peer nominated antisocial behavior or self-reported aggressive behavior.

Children were divided into a high, moderate, and low antisocial behavior trajectories. Children who started off on the high antisocial behavior trajectory had large reductions at ages 10 and 11 (Cohen's d = 1.2) in the level of peer-nominated antisocial behavior and self-reported co-occurring behaviors, due to intervention (the size of the mean difference in antisocial behavior between control and intervention children). Also, the reductions in antisocial behavior were associated with less deviant friends and lower percentages of peer rejection. The GBG did not have an impact on antisocial behavior in the moderate and low antisocial development trajectory.

Vuijk, van Lier, Crijnen, and Huizink, 2007

The sample for this analysis assessing victimization at age 10 (grade 5) and depression symptoms at age 13 includes 448 children (60% retention from eligible sample). Loss to follow-up was not related to the child's gender, nor to intervention condition, but was related to being of non-Caucasian ethnicity and of low socio-economic status. The model controlled for the main effect of female sex and for levels of anxious/depressive problems and social problems at age 7. In the non-mediation model, GBG children had reduced levels of child-reported depression, generalized anxiety, and panic agoraphobia, as well as reduced levels of child-reported physical and relational victimization at age 10. When allowing for the indirect paths, the direct paths to depression, anxiety, and panic agoraphobia became insignificant, with the paths to depression and panic mediated through relational victimization, and the path to anxiety mediated through both relational and physical victimization.

van Lier, Huizink, and Crijnen, 2009

Assessments for substance use from age 10 to 13 were available for 525 children (71% retention from eligible sample). Loss was not related to the child's gender or intervention condition, but was related to being of low SES. Tobacco and alcohol use measures from age 10 to 13 years were available for 477 children. GBG children demonstrated lower probabilities of onset of tobacco use from age 10 to 13 years, but no effect on growth (slope). There was no significant effect on past year and past month alcohol use, but there was reduced growth in weekly use.

Study 3

Summary

The study examined the impact of the Good Behavior Game with added curricula enhancements (classroom-centered intervention). Three first-grade classrooms in each of the nine elementary schools (a total of 678 first-grade children) were randomly assigned to one of the three conditions: the classroom-centered intervention that included the GBG, the Family School Partnership, and the control classroom.

Compared to children in control classrooms:

GBG boys and girls demonstrated significantly fewer teacher-rated total problems in first and second grade.
GBG boys had significantly fewer peer nominations for aggression in first grade than control boys.
By the spring of sixth grade, GBG children were significantly less likely to have a lifetime diagnosis of conduct disorder, to have been suspended from school, and to have received or been judged in need of mental health services.
At grade 7, GBG students had a modest attenuation in the risk of smoking (26% vs. 33%).
At grade 8, GBG students were significantly less likely to report the onset of tobacco use.
At grade 8, GBG students were significantly less likely to have started use of cocaine powder, crack, or heroin (i.e., Other Illegal Drug Use Scale).
At age 19, GBG prolonged survival time to the first cigarette smoked.
At age 19, GBG students had higher reading and math scores, higher odds of high school graduation and college attendance, and lower odds of special education services.

Evaluation Methodology

Design: A randomized block design was employed in 1993, with schools serving as the blocking factor. Three first grade classrooms in each of the nine elementary schools (a total of 678 first grade children) were randomly assigned to one of the three conditions: the classroom centered (CC) intervention that included the GBG, the Family School Partnership (FSP), and the control classroom. The interventions were kept in place throughout the Grade 1 academic year.

The classroom-centered intervention consisted of three components: (1) curriculum enhancements; (2) enhanced behavior management practices with the use of the Good Behavior Game (GBG); and (3) back-up strategies for children not performing adequately. Each CC class was divided into three heterogeneous groups, which provided the underlying structure for the curriculum and behavioral components of the classroom intervention. An interactive read-aloud component was included to increase listening and comprehension skills. Reader's Theater and journal writing were added to increase composition skills, whereas the "Critique of the Week" was added to increase critical thinking skills. The existing mathematics curriculum was replaced with the Mimosa math curriculum, a whole language approach to the development of mathematics skills. In addition, the GBG was refined to include a focus on off-task and inattentive behaviors. The strategies employed with respect to academic non-responders included individual or small-group tutoring, and modifications in the curriculum to address individual learning styles.

The Family School Partnership intervention was designed to improve achievement and reduce early aggression, shy behavior, and concentration problems by enhancing parent-teacher communication and providing parents with effective teaching and child behavior management strategies. The major mechanisms for achieving those aims were (1) training for teachers and other staff members in parent-teacher communication and partnership building, (2) weekly home-school learning and communication activities, and (3) a series of nine workshops for parents led by the first-grade teacher and the school psychologist or social worker. The parent workshop series began immediately after the pretest assessments in the Fall of first grade and ran for seven consecutive weeks (one workshop per week). In addition to the workshops, a voice mail system was put in place in each school to sustain parent involvement, to facilitate parent-school communications, and to maintain collaborations around the children's learning or behavior management difficulties. To foster better family-school communication, teachers sent out weekly comment sheets to parents with take-home activities. Parents were asked to fill out and return the comment sheets indicating whether they had completed the assigned activities or encountered any problems in doing so. The control classroom youth received these school's usual services. First grade teachers completed 60 hours of training and received certification in the CC and FSP to ensure fidelity. All intervention teachers attended monthly meetings to discuss common intervention issues and to receive ongoing support. Students were followed from first grade through age 19, and periodic assessments were conducted in Grades 1-3 and Grades 6-12 and at age 19 regarding a variety of mental health and academic outcomes.

Of the 653 children who had consent to participate in the evaluation, 91.3% (597) completed the Fall and Spring of first-grade assessments and remained in their assigned intervention condition over the first-grade year, and 88.5% (578) completed Spring of second grade assessments. At follow-up, 5, 6, and 7 years after randomization (sixth through eighth grades), approximately 84% (556/678) of the sample was available and was re-assessed in early adolescence (mean age 13 years). The majority of the 566 youth involved in the follow-up assessments completed all assessments. Of the 678 youth in the original cohort recruited in 1993, 566 participated in the sixth, seventh, or eighth grade assessment and 501 completed all outcome assessments. Only seven youths completed only the sixth grade assessment. There were no significant differences in rates of attrition between the intervention conditions, nor were there any between-group differences with respect to the sociodemographic characteristics of children with missing data. Finally, there were no between-group differences in pretest or baseline levels of academic achievement or in teacher and parent ratings of problem behaviors among the children with missing data in the Spring of first and second grades. At the six-year assessment, 549 (81%) of 678 originally recruited students participated. A total of 574 youths consented to a Grade 12 assessment. No significant differences were found in attrition or refusal rates between or across intervention conditions. There were no differences in the sociodemographic characteristics (ethnicity, gender, age, or fee lunch status) in terms of rates of attrition at 12th grade or at the age 19 interview across the intervention conditions.

Sample: The sample ranged in age from 5.3 to 7.7 years with a mean age of 5.7 years. Fifty-three percent were male, 86% were African American, 14% were Euro-American heritage. Nearly two-thirds of the children (62%) received free or reduced lunch.

Measures: Children's social adaptational status on core task demands in the classroom was measured by the Teacher Observation of Classroom Adaptation-Revised (TOCA-R). The Parent Observation of Child Adaptation (POCA), designed as a counterpart to the (TOCA-R), assessed adaptation to the demands of the family social field. The ten-item Peer Assessment Inventory (PAI) assessed adaptation to the demands of the classroom peer group. Two composite scores from the PAI were created using the bullying/victimization and participation/shy questions. Finally, the Comprehensive Test of Basic Skills (CTBS) was used to measure scholastic achievement.

Youth's self-reported use of tobacco was first assessed 6 years after the end of the intervention when they were 12 years old on average and then annually through approximately age 19. The audio computer-assisted self interview (ACASI) method was used to assess youth tobacco use.

The Structured Interview of Parent Management Skills and Practices-parent version (SIPMSP) was used to assess the major constructs included in the Oregon Social Learning Center model of the development of antisocial behavior in children. The SIPMSP subscales assessed (1) parental monitoring and supervision, (2) inconsistent discipline, (3) parental reinforcement and involvement, and (4) rejection of the child. In the spring of the sixth through eighth grades, students completed ACASI assessments developed for the National Household Survey on Drug Abuse with a focus on early-onset drug initiation. Students were asked whether or not they had started to use one or more of the following drugs or drug groups by the time of the final assessment in grade eight: tobacco, alcoholic beverages, inhalants, marijuana, or other illegal drugs such as cocaine and heroin.

In the spring of sixth grade, the Teacher Report of Classroom Behavior-Checklist Form (TRCB CF) was administered to English/language arts, reading, and math teachers in order to obtain teacher reports of child conduct problems in the school setting. The Conduct Disorder model of the Diagnostic Interview Schedule for Children (DISC-IV) was also used in the sixth grade to determine whether the youth met DSM-IV criteria for a lifetime diagnosis of Conduct Disorder based on youth and parent reports. In addition, the Service Assessment for Children and Adolescents Parent Report (SACA-P) was administered along with the DISC-IV to obtain lifetime information on child mental health service utilization. The School Mental Health Professional Report (SMHPR) checklist was administered to school mental health professionals to determine whether they had provided mental health services to a study participant in the last year. Finally, SIPMSP was re-administered to parents in the spring of the sixth grade.

From grades 6-12, the Teacher Report of Classroom Behavior Checklist, an adaptation of the TOCA-R described above, was used to assess classroom behavior and academic performance. The Kaufman Test of Educational Achievement (Reading and Math) was administered in Grade 12. Special Education Service Use (Grades 1-12) and high school graduation records were provided by the district. At the age 19 interview, the youths were asked whether they had attended college (e.g., 4-year college, junior college).

Analysis: For the analysis of 1st and 2nd grade data, mixed model analysis of variance was used to determine the impact of the interventions on the early antecedent risk behaviors. Within these analyses, planned contrasts were performed between the CC intervention and control condition and the FSP intervention and control condition. School was treated as a completely randomized blocking factor, the classroom being the unit of randomization. To examine model fit in the presence of significant intervention by pretest level interactions, the mixed model analyses were followed up with nonlinear statistical analyses that used Lowess. The analyses of implementation impact were divided into (1) "intention to treat" analyses and (2) implementation/participation analyses. For the implementation/participation analyses, CC classrooms were identified as either high- or low-implementing based on scores obtained from a three-phase CC implementation measurement procedure. The analysis thus included three levels of treatment: (1) control, (2) CC low-implementation (CC-Low), and (3) CC high-implementation (CC-High). Planned comparisons were then carried out between the control and CC-High and -Low conditions. For the FSP implementation/participation analyses, two indices were used: (1) the number of core or Fall of first grade workshops attended and (2) the number of weekly take-home activities completed. For analytic purposes, both of these indices were divided into tertiles: High, Medium, and Low. For the tobacco outcomes, the analytic plan included an estimation of cumulative risk of starting to smoke tobacco, basic contingency table analyses, and the use of standard life tables and survival analysis methods to compare the estimated risk of starting to smoke tobacco across the study subgroup. Cox regression models for time-to-event data were used to estimate the impact of interventions on the risk of starting to smoke.

For sixth grade data, mixed (i.e., random effects regression) model analysis of variance was used in the case of interval-level variables, and logistic regression was used with the nominal outcomes. Seventh grade tobacco initiation was assessed using survival methods.

For the substance use outcomes assessed at grade 8, generalized estimating equations multivariate response profile regressions were used to estimate the relative profiles of drug involvement for intervention youths versus control. The analysis sequence began with initial cross-classifications and logistic regression estimates of intervention impact. Thereafter, a generalized estimating equation (GEE) multivariate response profile analysis model was used to express the cumulative occurrence of drug use through middle school as a function of intervention status and covariates, with the five following response variables in the multivariate response vector, all coded as 1 when the youth had used the listed drug by the time of the final assessment, and 0 when the youth had not used the listed drug: tobacco, alcohol without parental permission, inhalants, marijuana, and other illegal drug use.

At age 19, mixed model regression analyses were run to analyze academic outcomes. A discrete survival analysis was used to model survival to the first cigarette smoked.

Outcomes

Spring of 1st and 2nd Grades (Ialongo et al, 1999):

Teacher-rated problem behaviors: For CC boys' total problem behaviors as rated by teachers, the mixed model analyses yielded significant CC main effects in the Spring of first and second grades. In addition, the CC by baseline interaction proved significant in the Spring of second grade. Overall, CC boys were rated as having significantly fewer problem behaviors than control boys, the greatest benefit being in second grade and accruing to boys with moderate elevations in total problems at baseline. Main effects were also found for CC girls in the Spring of first and second grades. Teachers again rated CC girls as having fewer total problem behaviors than control girls in the Spring of first and second grades. Problem behaviors, as rated by parents, were not significant for boys or girls.

Although there were no significant main or interaction effects for FSP boys' and girls' teacher-rated problem behaviors in the first grade, FSP boys and girls demonstrated significantly lower levels of teacher-rated problem behaviors than control children in the second grade.

Implementation analyses: Five of the nine CC classrooms were identified as high-implementation classrooms (scored more than 50% on a scale of implementation fidelity). Boys and girls in low-implementing classrooms were rated by teachers as having significantly more total problem behaviors than boys in high implementing classrooms. There was little relationship between achievement in the Fall of first grade and completion of take-home activities, except at the extreme ends of the distribution of achievement. The parents of children at the lowest end of the achievement distribution in the Fall of first grade tended to complete the fewest take-home activities, while the opposite was true for parents of children at the highest end of the achievement distribution. A similar relationship was noted for teacher ratings of boys' total problem behavior. Parents of boys with the highest ratings of problem behaviors attended the fewest workshops. For girls, there was a positive, linear relationship between achievement and take-home activity completion. In general, the higher the level of achievement, the more activities were completed. A modest negative relationship was found between teacher ratings of girls' problem behavior and parent workshop attendance.

The CC-High main effect for boys was significant in the Spring of first and second grades, with CC-High boys exhibiting fewer problem behaviors than control boys. Comparisons between the CC-Low and control conditions yielded a significant main effect in the Spring of first grade, but failed to reach significance in the second grade.

The CC-High condition exhibited significant main effects for girls' teacher-rated total problem behaviors in the Spring of first and second grades, with CC-High girls rated as showing fewer total problem behaviors than control girls. As was found with the boys, the comparison between the CC-Low and control girls yielded a significant main effect in the first but not the second grade.

For FSP High boys and girls, a significant main effect was found only in the Spring of the second grade with FSP High boys being rated as having fewer total problem behaviors than control boys. A nearly identical significant main effect for the FSP Medium condition was found only in the Spring of second grade.

Parent-rated problem behaviors: No significant effects were found for CC or FSP boys or girls.

Peer nominations of behavior: CC boys had significantly fewer peer nominations than control boys for aggression in the Spring of the first grade. FSP boys with mild to moderate elevations in pretest levels of peer-nominated aggressions had fewer nominations in the Spring of the first grade than did control boys. CC-High boys had significantly fewer peer nominations for aggressive behavior in the Spring of the first grade than did control boys. No significant effects were found for the boys in the CC-Low condition, nor were there significant main or interaction effects for the CC-High or -Low conditions in terms of peer nominations for social participation/shy behavior. No significant main or interaction effects were found for CC girls for either aggressive behavior or social participation. Finally, no significant main or interaction effects were found for FSP boys or girls for aggression or social participation/shy behavior.

Achievement: CC high boys had significantly higher reading and math achievement scores in the Spring of first and second grades than did control boys. CC High girls whose pretest levels of math achievement were at or above the 40th Normal Curve Equivalent (NCE) demonstrated greater math achievement than control girls. The main effect for FSP High boys in first grade reading achievement approached significance with FSP High boys having higher reading achievement scores in the Spring of first grade than did control boys. FSP High boys also significantly outperformed control boys on math achievement in the first grade.

6th Grade Outcomes (Ialongo et al., 2001):

No significant gender X intervention interactions were found. Teachers rated children in the CC and FSP interventions as having significantly lower levels of conduct problems than children in the control group. Students in the CC classrooms also had a significantly lower probability of being diagnosed with a conduct disorder in their lifetime than children in the control classrooms. Children in the CC intervention were significantly less likely to have been suspended in the sixth grade than children in the control condition. In addition, girls in the FSP intervention were significantly less likely to have been suspended in the sixth-grade year than girls in the control condition. Finally, relative to controls, CC intervention children were significantly less likely to be judged by teachers to be in need of mental health services. Parent and teacher reports indicated that CC children were also less likely to have received mental health services than controls.

Mediational analyses
The first hypothesis tested was that the impact of the interventions on the distal outcomes would be mediated by improvement in the early risk behaviors of attention/concentration problems and shy and aggressive behavior. A change score was computed representing the improvement in the early risk behaviors from the fall of first grade to the spring of second grade. Each of the interventions was found to have a significant impact on the change score in the expected direction, and the change score was significantly related to each of the distal outcomes. The change score was added into each of the impact analyses where a significant intervention effect was found. The coefficient for the intervention effect was then examined to determine whether it remained significant after introducing the change score into the equation. The size of the intervention coefficients and test statistics decreased in all instances when the change score was introduced into the model. In addition, the hypothesis that the statistically significant grade 6 CC and FSP intervention effects on antisocial behavior might be mediated through improved parenting practices via improvement in the early risk behaviors was tested. It was hypothesized that, relative to controls, CC and FSP intervention parents would be more likely to engage in reinforcing activities with their children and would be less likely to reject them. The analyses indicated that FSP and CC intervention parents, relative to controls, reported less rejection and greater involvement in reinforcing activities with the target child. Next, the relationship between the sixth grade rejection and reinforcement constructs and the change score representing the improvement in the early risk behaviors was examined. A significant relationship was found between the change score and parent rejection, but not reinforcement. The CC and FSP intervention effects on parent rejection were mediated through the change in the early risk behaviors. Subsequently, the rejection construct was examined to determine if it was significantly associated with the sixth grade outcome variables that the CC and FSP interventions had a significant impact upon. Finally, where a significant relationship was found between the rejection construct and the outcome variable, it was then tested whether the intervention effects were reduced in significance and magnitude when both the early risk behavior change score and the rejection construct were included in the model. In all cases, the significance level and the size of the intervention coefficient decreased in all cases.

In sum, evidence was found to suggest that the impact of the interventions on their distal targets was in part mediated by improvement in the early risk behaviors of attention/concentration problems and shy and aggressive behavior.

7th Grade Outcomes (Storr et al., 2002):

At the 6 year follow-up (grade 7), a total of 156 (28%) students reported that they had started to smoke tobacco. Relative to controls, a modest attenuation in the risk of smoking was shown for students who had been assigned to the CC (26%) or FSP (26%) intervention classrooms vs. control classrooms (33%).

8th Grade Outcomes (Furr-Holden et al., 2004):

Both CC and FSP students were significantly less likely to report the onset of tobacco use at the end of 8th grade than those students in the control group. Students in the CC classrooms were significantly less likely to have started use of cocaine powder, crack, or heroin (i.e., Other Illegal Drug Use Scale) by the end of grade eight compared to the control group. There were no significant effects for alcohol, inhalants, and marijuana use.

Age 19 Outcomes (Wang et al., 2009; Bradshaw et al., 2009):

The CC intervention prolonged survival to the first cigarette smoked, whereas the FSP intervention did not. Although not a moderator of survival, depressed mood was associated with reduced survival time to the first cigarette smoked. CC students scored higher than controls in reading and math, had greater odds of high school graduation and college attendance, and lower odds of special education service use.

Note: The effects from the second intervention, Family-School Partnership (FSP) were smaller than the Classroom-Centered Intervention and, in most cases, not statistically significant.

Study 4

Summary

Barrish et al. (1969) examined 24 students in a fourth-grade classroom who received the intervention but did not have a control group.

Barrish et al. (1969) found that the treatment group showed

A reliable, positive effect on disruptive behavior, with out-of-seat and talking-out behaviors reducing maximally during the game.

Evaluation Methodology

Design: The Good Behavior Game Program was analyzed in a fourth-grade classroom of 24 students. Seven of the students had been referred several times to the school principal for behavior problems, and the school principal reported that a general behavior management problem existed in the classroom. The game was first introduced during a math class, then extended to both reading and math periods. The teacher began the game by dividing the class into two teams, then describing the rules of the game (e.g., that teams could win the game by refraining from out-of-seat and talking-out behaviors, and that the winning team(s) would earn privileges such as extra recess time, while the losing team would have to work on extra assignments or after school). Neutral observers in the classroom recorded instances of disruptive behavior both before and during the GBG.

Sample: Twenty-four fourth grade students.

Measures: One or two observers visited the classroom three times per week for approximately one hour. Two types of disruptive behavior were tracked by the observers: out-of-seat behavior and talking out behavior. Out-of-seat behavior was defined as leaving the seat and/or seated position during a lesson or scooting the desk without permission. Talking-out behavior was defined as talking or whispering without permission.

Analysis: Analysis was divided into the following four phases: (1) baseline math and reading behavior scores; (2) an examination of the effect of playing the GBG during math only, on math behavior scores and reading behavior scores; (3) an examination of the effect of playing the GBG during reading only, on math behavior scores and reading behavior scores; (4) an examination of the effect of playing the GBG during both math and reading, on math behavior scores and reading behavior scores.

Outcomes: Results indicate that the game had a reliable, positive effect on disruptive behavior, with out-of-seat and talking-out behaviors most reduced during the game. Baseline scores indicated a median of 96% of the one-minute intervals scored by the observer contained talking-out, and 82% contained out-of-seat behaviors. When the game was played during the math period, these scores declined to 19% and 9%, respectively. However, when the game was not played during the reading period, reading period scores remained unchanged. When the game was withdrawn during math and played during reading, similar results were found, with disruptive behaviors resuming to previous levels during math, but declining during reading. When the game was played during both reading and math, disruptive behaviors again declined for math and continued to be lower for reading. Both teams almost always won the Good Behavior Game, winning 82% of the time.

Study 5

Summary

Medland et al. (1972) examined an intervention group with 28 fifth-grade students who displayed virtually uncontrollable behavior during reading class, but the study did not include a control group.

Medland et al. (1972) found a reduction in problem behaviors among the intervention group.

Evaluation Methodology

Methods: The GBG was implemented in one public school, for 28 fifth graders who displayed virtually uncontrollable behavior during reading class. The class was divided into two groups, and the teacher defined the targeted behaviors as above, with the game focused on reducing out-of-seat behavior, talking-out without permission, and disruptive behavior (i.e., kicking or hitting another, clapping, turning around in one's seat, etc). The instructor then introduced the game, explained the rules, and described how teams would win and what rewards they would gain. Observers recorded student behaviors at baseline (five sessions prior to implementation), during the initial implementation, after the game ended, and after the game was reintroduced.

Outcomes: According to these observers, the introduction of the game reduced the problem behavior of Group 1 by 99% and Group 2 by 97%. As in the first experiment, problem behaviors increased after the game ended (but did not reach prior levels, with Group 1 returning to only 33% of their original scores, and Group 2 returning to 82%), but then dropped again after the game was reintroduced.

Study 6

Summary

Mihalic et al. (2011) examined 13 schools in the Denver metro area with 36 first-grade classrooms. The classrooms were assigned at random to the GBG intervention or to a control condition.

Mihalic et al. (2011) reported the following findings:

GBG boys had higher levels of relational aggression and lower levels of social isolation than control group boys at posttest, findings which were not evident one year later.
GBG boys, in comparison to control group boys, had lower levels of achievement and concentration at the one-year follow-up.
At posttest, GBG girls had lower levels of antisocial behavior and lower proportions of GBG girls were in the upper quartiles of the antisocial and problem child measures compared to control group girls; these findings were no longer significant one year later.
At the one-year follow-up, GBG girls as compared to control group girls had higher levels of rejection and lower levels of achievement and concentration (marginal).
The GBG did not have the desired effect on students as evidenced by the few, sometimes negative, short-lived and substantively small differences found.

Evaluation Methodology

Design: In spring and summer 2008, elementary schools from four metro Denver, Colorado school districts were invited to participate in this replication evaluation. Thirteen schools entered the study resulting in 36 first grade classrooms for the 2008-2009 school year. Within schools, classrooms were assigned at random to the GBG intervention or to a control condition. This process yielded 18 GBG classrooms and 18 control classrooms. Though students were not strictly assigned at random to classrooms, school principals assured the investigators that assignment to classes was more or less at random and that students were not assigned to classes on the basis of known behavioral problems or other personal characteristics such as race, gender, or parent social status. The Good Behavior Game was implemented for one school year. Teachers in the intervention condition received two days of training on GBG implementation prior to the start of the school year and individualized bi-weekly classroom-based mentoring from GBG coaches throughout the school year. Baseline (beginning of school year) and posttest (end of school year) data were collected from structured interviews with all teachers and through direct observation of students in classrooms. Classroom observations were also conducted at three additional equally spaced times during the school year. At the end of the following school year (May 2010), second grade teachers completed interviews for all study participants remaining within the participating schools.

Sample Characteristics: The sample included 859 first-grade students; 49.8% were male, 54% were white, 36% were Hispanic, and 10% were Asian or other ethnicities. The thirteen schools ranged widely in economic diversity of their students as measured by the proportion of students eligible for free or reduced lunch (4% to 85%). Five schools were under 25%, five were in the 25-49% range, and three schools were over 50%.

Measures: Structured interviews were conducted with teachers using the Teacher Observation of Classroom Adaptation - Revised (TOCA-R). Information was collected on each student in a teacher's classroom via the TOCA-R's sequence of 53 items and 8 global ratings of general adaptation, all measured on a 6-point Likert scale. Measures included Achievement, Antisocial, Relational Aggression, Concentration, Impulsivity, Social Isolation, Prosocial Behavior, Rejection, and Victimization. A separate measure of Physical Aggression was created using a subset of items from the larger antisocial behavior measure.

A cluster analysis on the nine main measures of the TOCA-R produced four groupings of children. From these groupings, a binary measure was created to identify the group of Problem Children with high scores on antisocial, impulsivity, and rejection, and low scores on achievement and concentration. Two additional binary measures were created to identify males in the top one-quarter on the antisocial measure for males, and a similar measure for females.

Two student measures and one classroom measure were created from the classroom observation data at each time period. Aggressive was defined as the proportion of observations of a student in which the student was observed as being aggressive. Disruptive was the proportion of observations of a student in which the student was observed as being disruptive. A measure of General Classroom Disorder was defined as the proportion of classroom observational scans that were scored as being at least somewhat disorderly (only fairly well-behaved, poorly behaved or chaotic vs. well-behaved).

Analysis: Given the nested nature of the sample, students nested within classrooms, and classrooms within schools, hierarchical or multilevel models were used to examine the effect of the GBG intervention on students at posttest (end of first grade) and at one-year follow-up (end of second grade). Models incorporated Poisson distributions for highly skewed variables, normal distributions for variables with bell-shaped distributions, and Bernoulli distributions for binary variables. Each model included the pretest measure of the variable being examined and school economic disadvantage as measured by the percent of students receiving free or reduced lunch. To replicate previous evaluations of the GBG and because of potential sex differences, models were run separately for boys and girls.

Outcomes

Baseline Equivalence and Differential Attrition: Of the 859 original students, 797 (92.8%) remained at posttest, and 626 (72.9%) remained at the one-year follow-up. Although there was substantial loss, the sex distributions of the overall and the GBG and control subsamples were not affected by the sample loss. Boys and girls were analyzed separately on the following measures: Race, Antisocial, Physical Aggression, Achievement, Concentration, and Problem Child. No baseline differences or differential attrition from pretest to the one-year follow-up were found for males on any of these measures. For females, at pretest, the GBG group had lower levels of antisocial and higher levels of achievement and concentration compared to the control group. However, neither group differed significantly from the overall distribution of female subsample. The differences in achievement and concentration between the GBG and control samples remained and were slightly stronger at the one-year follow-up. At this same time, the GBG group also differed on these variables when compared to the total female pretest sample.

Posttest: At the end of first grade, boys were found to be different on only two measures. Boys in the GBG group had higher levels of relational aggression and lower levels of social isolation compared to the boys in the control group. For girls, the GBG intervention was associated with decreases in antisocial behavior and reductions in the proportion of girls in the upper quartiles of the antisocial and problem child measures. In the observation data, there were too few instances of aggression observed to analyze. However, for girls there was a marginal decrease in disruptive behavior for the GBG group as compared to the control group (p=.085). The GBG intervention had no significant impact on classroom disorder over the period of the intervention.

One-Year Follow-up: At one year post-treatment, GBG boys, when compared with control group boys, had lower levels of achievement and concentration as rated by their second grade teachers on the TOCA-R. GBG girls were also rated lower on achievement and marginally lower on concentration (p=.083) compared to girls in the control group. Additionally, rejection was higher for the GBG girls.

The mean differences found between intervention and control groups for both boys and girls on the TOCA-R measures were relatively small, even when statistically significant. Significant differences fell between categories on a 6-point scale (almost never, rarely, sometimes, often, very often, almost always), with the mean difference usually about 0.2. For the measures of particular interest, antisocial behavior and aggression, the mean differences were between the 'almost never' and 'rarely' categories.

Study 7

The version of GBG included in this study varies from the U.S. version in the following ways: 1) Teachers do not mention the children who violated GBG rules, nor do teams get penalty points, and 2) Teams do not compete for rewards.

Summary

Leflot et al. (2010, 2013) examined 15 elementary schools, 30 second-grade classrooms, and 570 students in rural to moderately urban communities in the Flemish-speaking part of Belgium. Within each school, classrooms were randomly assigned to the intervention (classroom n = 15; student n = 287) or control (classroom n =15; student n = 283) condition.

Leflot et al (2010, 2013) found that relative to controls,

teachers in the intervention group used less negative remarks;
children in the intervention group showed more on-task behavior;
children in the intervention group showed less talking-out behavior;
teachers in the intervention group used more praise;
children in the intervention group showed a decreased rate of growth in oppositional behavior.

Evaluation Methodology

Design:

Recruitment /Sample size: Fifteen elementary schools in rural to moderately urban communities in the Flemish speaking part of Belgium participated in the study (school selection and recruitment procedures are not described). Each school had two second grade classes, for a total of 30. Of the 587 students in these classes, 570 (97%) received consent to participate in the study.

Study type/Randomization/Intervention: Within each school, second grade classes were randomly assigned to the intervention (classroom n=15; student n=287) or control (classroom n=15; student n=283) condition. Intervention group classrooms implemented the GBG program across the second and third grades with students remaining in intact classes when moving from one grade to the next.

Assessment/Attrition: Assessments were completed prior to program implementation at the beginning of second grade (pretest), at the end of second grade, at the beginning of third grade, and at the end of the third grade after completion of the program (posttest). All assessments were scheduled during non-GBG periods. From pretest to posttest, overall attrition was 7.2%.

Sample Characteristics:

The mean age of the sample at baseline was 7 years and 5 months; 49% were boys and over 95% of the sample was White and of Belgian nationality. Among the parents, 63% of the mothers and 57% of the fathers had completed higher education. The schools came from rural to moderately urban communities.

Measures:

Hyperactive and oppositional behaviors were rated by the student's peers who were asked to nominate all children in the classroom with the behavioral descriptions "Cannot sit still in the classroom" (hyperactive) and "Disobeys in school" (oppositional). The number of nominations per description were summed and divided by the number of children in the class minus one (accounting for the rated student). Internal consistency was .91 for hyperactivity and .90 for oppositional behavior.

On-task behavior was measured by two trained observers of the research team who appear to know membership condition. They recorded behaviors of each child for 20 seconds six times during each observation session. Before the start of the study, the two observers were trained during multiple sessions using videotapes of classroom situations made for the purpose of demonstrating this observation procedure and determining the interobserver agreement. After each training session, interobserver agreement was calculated. Retraining exercises were concluded when interobserver agreement levels of 80% or higher were reached. During observations, each child received a score ranging from 0 (not on task at all during the whole 20 seconds) to 3 (on task during the whole 20 seconds). Based on the observation of their baseline on-task behavior at wave 1, children were identified as either moderate/high on-task or low on-task. Children who had an observed individual score that was at least 1 standard deviation lower than the grand mean were assigned to the low on-task group (dummy coded as 1). All other children were assigned to the moderate/high on-task group (dummy coded 0).

Disruptive behavior was also measured by the two trained observers described above. Talking-out and out-of-seat behaviors were tallied on an observation sheet if they occurred during the 20 second time intervals described above. For both behaviors, a mean score was calculated, resulting in a score ranging from 0 to 1 (with 1 indicating that a child demonstrated the behavior at all six intervals of observation). Prior to data collection, the observers rated child's behavior simultaneously during live classroom situations with 94% and 91% agreement for talking-out and out-of-seat behavior, respectively.

Teacher's behavior management was observed at three time points. Teachers were observed for 10 minutes with all verbal praise for positive classroom behavior and negative remarks for disruptive behavior tallied on an observation sheet during 20 second intervals. Mean scores were calculated for verbal praise and negative remarks over the three observation periods. Prior to data collection, there was 93% agreement in observer-ratings of teacher's behavior management during live classroom situations.

Aggression was measured in interviews of the children done by the same researchers doing the observation. The measure used peer ratings where children were asked to list all children in the classroom who met the behavioral description (sometimes hits children). The number of nominations received was divided by the number of children in the classroom minus one so that scores ranged between 1 (everyone, except oneself nominated this child as aggressive) and 0. The internal consistency reliability coefficient was 0.85 and the test-retest reliability coefficient was .75.

Peer rejection was measured in interviews using peer ratings where children were asked to list all children they liked least and these nominations were translated into a score indicating the level of peer rejection (e.g., score of 1 meant that all children in the classroom nominated this child as the least liked). The internal consistency reliability coefficient was 0.81 and the test-restest reliability coefficient was .60.

Analysis:

Leflot et al. (2010): Teacher's behavior management and children's classroom behavior were analyzed using a series of analyses of covariance (ANCOVAs). For the end-of-year data (collected after second and third grades) values from the beginning of the respective school year were entered as covariates.

The development of hyperactive and oppositional behaviors was analyzed with separate latent growth models using maximum likelihood estimation. All standard errors were adjusted for classroom level variation by using a sandwich estimator and growth parameters were regressed on condition and gender.

A mediation model was used to test the mediation of the development of hyperactive and oppositional behaviors. Teacher's behavior management and children's classroom behavior at the end of second and third grades were regressed on the respective scores from the beginning of the school year and on condition. End-of-year children's classroom behavior from second and third grades was further regressed on gender and the intercept of the behavior of interest (i.e. hyperactive behavior and oppositional behavior, separately) as well as the respective end-of-year teacher behavior management variables. Allowing for the mediation paths, the slope of the behavior of interest (i.e. hyperactive behavior and oppositional behavior, separately) was then regressed on second and third grade end-of-year on-task and talking-out behavior. For oppositional behavior (only), adding the third grade mediators to the model did not improve model fit and were thus dropped from the model. Non-significant paths from children's classroom behaviors to the slope of the behavior of interest were set to zero and reportedly did not affect model fit.

Leflot et al. (2013): The analyses were conducted in two stages. The first stage examined whether individual variation in baseline levels of on-task behavior moderated the effect of the intervention on the development of aggression, while controlling for the mean classroom level of on-task behavior at baseline. A multilevel growth model was fitted with variation across time at level 1, variation across individuals at level 2, and variation across classes at level 3. The development of children's aggression over the four assessments was captured by latent growth parameters, in which the intercept reflects initial level differences (at baseline) and the slope reflects change over time. The growth parameters were regressed on sex and on-task behavior at the individual level (within level or level 2) and regressed on intervention status and the latent classroom mean of children's baseline level of on-task behavior at the classroom level (between level or level 3).

To test for this moderation, a random slope parameter was considered to reflect a cross level (classroom-to-individual level) interaction variable. Modeling this random slope provides a test of whether the effect of children's on-task behavior on the individual level growth parameters could vary across classrooms. The random slope parameter was regressed on intervention status and classroom level of on-task behavior (level 3). A significant effect of intervention status on this random slope parameter indicates that the effect of individual level variation in children's on-task behavior on the growth parameters of aggression depends on intervention status, regardless of possible classroom level effects of on-task behavior.

The second stage mediation analysis examined children who had initially low on-task scores. The wave 4 value of the mediator was regressed on its wave 1 value and sex. Mediation was tested for (1) by regressing the growth factors of aggression on intervention status, (2) by regressing the wave 4 value of on-task behavior (peer rejection) on intervention status, and (3) by regressing the slope of aggression on on-task behavior (peer rejection) at wave 4, controlling for intervention status. The test came from calculating the significance of the pathway that comprised the indirect effect.

To control for the multilevel design of the data, all standard errors were adjusted for classroom level variation by using a sandwich estimator. All models were analyzed with the Mplus missing data module to handle missing data.

Intention-to-treat: The study reported that students who moved or were retained in the original grade were "dropped" from the study, indicating that the researchers may not have adhered to the intent-to-treat principle. However, the study states (pg. 192) that the intent-to-treat principle was adhered to by including all the other students in the analysis regardless of dose received.

Outcomes:

Implementation Fidelity: Teachers implementing the intervention received a half-day group training followed by eight 60-minute observations during each implementation year provided by a trained school consultant. The consultant also rated implementation quality on a scale from 0 to 2 during observations on six categories covering the intervention components. The mean fidelity score was 9.21 out of 12.

Baselin e Equivalence: No significant differences were found between conditions on demographic or outcome variables.

Differential Attrition: The rate of attrition was 7.2%, and while it did not differ significantly by condition, children who dropped out of the study (because of grade retention or moving away from school) had higher levels of peer-rated hyperactive behavior, oppositional behavior, aggression and rejection.

Posttest:

Leflot et al. (2010): For teacher's behavior management, ANCOVAs found no significant differences at the beginning of second or third grades. At the end of second grade, it is reported that teachers in the intervention group used significantly less negative remarks and marginally significant more praise relative to control group teachers. At the end of third grade, intervention group teachers used significantly more praise.

For children's classroom behavior, ANCOVAs found that at the end of second grade children in the intervention group showed significantly more on-task and less talking-out behavior than controls. These differences were not present at the beginning of third grade (after the summer vacation), but were again observed at the end of third grade.

Latent growth models revealed a significantly different rate of change in peer-rated oppositional behavior for intervention and control group participants with controls experiencing an increase in oppositional behavior, and participants in the intervention group experiencing a decrease. The observed difference in peer-rated hyperactive behavior between intervention and control group participants indicated a marginally significant (p=.09) slower rate of increase in such behavior among intervention group participants relative to controls.

Leflot et al. (2013): The results offered evidence of both moderation and mediation. The analyses indicate that the intervention affected only children who score low on a measure of on-task behavior at baseline.

Moderation Analysis:

Leflot et al. (2013): At posttest for children with moderate/high scores for on-task behavior, the intervention did not have significant influences on on-task behavior, aggression, or peer rejection. For low on-task children, the intervention had a significant effect on two of three outcomes: on-task behavior (p=.05; effect size d=-0.47) and aggression (p=.03; effect size d=0.48). The third outcome, peer rejection, showed a marginally significant effect of the intervention (p=.08, effect size d=0.41).

Mediation Analysis:

Leflot et al. (2010): The mediation analyses for peer-rated hyperactive and oppositional behaviors revealed that direct intervention effects for these outcomes (found in the latent growth curve models) were not significant.

The mediation analysis for the development of hyperactive behavior found that the intervention had direct effects in reducing teacher's negative remarks and increasing child on-task behavior at the end of second grade. A marginally significant (p=.07) increase in teacher praise was also observed at the end of second grade. The reduction in teacher's negative remarks predicted increases in child on-task behavior at the end of second grade, and child on-task behavior at the end of second grade significantly reduced the slope of hyperactive behavior. The indirect path of the intervention via teacher negative remarks and child on-task behavior at the end of second grade significantly decreased the development of hyperactive behavior. Testing for gender differences using a multi-group model and a Wald test, the results showed that there were no differences in this indirect effect between boys and girls. The intervention had a significant effect, increasing teacher praise at the end of third grade. This increase in turn significantly predicted lower child talking-out behavior at the end of third grade; however, only on-task behavior at the end of third grade significantly contributed to (decreased) the slope of hyperactive behavior.

The mediation analysis for the development of oppositional behavior found that the intervention had direct effects in reducing teacher's negative remarks and decreasing child talking-out behavior at the end of second grade. A marginally significant (p=.07) increase in teacher praise was also observed at the end of second grade. At the end of second grade, the path from negative remarks to talking-out behavior, and from talking-out behavior to the slope of oppositional behavior were positive and significant. The indirect path of the intervention via teacher negative remarks and child talking-out behavior at the end of second grade significantly decreased the development of oppositional behavior. Testing for gender differences using a multi-group model and a Wald test, the results showed that there were no differences in this indirect effect between boys and girls.

Models excluding children's classroom behavior found no significant indirect effects through teacher's behavior management variables for hyperactive or oppositional behavior. A model assuming a reversed path, in which children's classroom behavior affected teacher behavior, found only non-significant indirect paths for both types of behavior. Models examining the significant indirect paths excluding teacher's behavior management variables found that the indirect path from the intervention to hyperactive behavior via child on-task behavior at the end of second grade was marginally significant (p=.06), and the indirect path from the intervention to oppositional behavior via child talking-out at the end of second grade was significant.

Leflot et al. (2013): For children with initially low on-task scores, peer rejection but not on-task behavior mediated the effect of the intervention on aggression. No support was found for the indirect effect of the intervention on the development of aggression via on-task behaviors; however, the indirect effect of the intervention on the development of aggression via peer rejection was statistically significant.

Study 8

Summary

Spilt et al. (2013) and Witvliet et al. (2009) examined 825 Kindergarten students from 47 classrooms in 30 elementary schools located in two urban areas in the western Netherlands and one rural area in the eastern Netherlands. Classrooms were randomly assigned to the intervention condition (66%, n = 31) or control condition (34%, n = 16).

Spilt et al. (2013) and Witvliet et al. (2009) reported the following findings:

Reducton in externalizng behavior and improvement in positive peer relations among GBG children, compared to controls.
Intervention effectiveness varied by subgroups of children with different baseline risk profiles.
Positive intervention effects on the development of internalizing behavior were found for three groups of children characterized as low-risk, victimized, and having internalizing problems.
Positive intervention effects on the development of externalizing behavior were found for low-risk children and those with internalizing problems.
No intervention effects were found for children with moderately-high to severe sociobehavioral risk or family and demographic risk.
For both externalizing and internalizing behavior, intervention effects were strongest for children with baseline internalizing problems.

Evaluation Methodology

Design: In 2004, 825 Kindergarten students from 47 classrooms in 30 elementary schools from two urban areas in the western Netherlands and one rural area in the eastern Netherlands were included in the study. All children who moved on to first grade (n=742) and those who were to repeat first grade (n=100) were eligible for inclusion. Parental consent was obtained for 90% of the children yielding a sample of 759 students as reported by Spilt et al. (2013). (Witvliet et al., 2009, report a sample of 758 students.) Classrooms were randomly assigned to the GBG (66%, n=31) or control condition (34%, n=16).

The Good Behavior Game was delivered in Grades 1 and 2. Assessments of externalizing and internalizing behaviors were conducted on four occasions: spring of Kindergarten, spring of Grade 1, fall of Grade 2, and early summer at the end of Grade 2. Child and social risk measures were collected in Kindergarten. Parenting and demographic measures were collected in the spring of Grade 1.

For the analyses described here, only children with at least two of the four outcome assessments were included (97.6%, n=741). No information was provided regarding the numbers of reports at each assessment; however, Witvliet et al. state that 113 children dropped out between first and second grades due to grade retention or moving to another school.

Sample Characteristics: Children were on average 6.0 years old at the beginning of the study. In accordance with the general Dutch population, 38% of the children were from low socioeconomic status (SES) families. The sample was ethnically diverse: 57% Dutch/Caucasian, 11% Moroccan, 10% Turkish, 6% Surinam, 5% Netherlands Antillean, and 12% other ethnic backgrounds.

Measures:

Outcome Variables-
Outcome measures were obtained using the Problem Behavior at School Interview (PBSI) in which teachers rate students' behaviors on a 5-point Likert scale ranging from 0 (never applicable) to 4 (often applicable). Externalizing Behavior was the average of the scores of the Conduct Problems scale (12 items, alpha=.90-.92) and the Oppositional Defiant Problems scale (7 items, alpha=.89-.96). Internalizing Behavior was the average of the Anxiety scale (5 items, alpha=.81-.83) and the Depression scale (7 items, alpha=.78-.85).

Risk Variables-
The Child Behavior Checklist - Teacher Report Form (TRF) was administered to Kindergarten teachers. Behaviors were rated on a 3-point Likert scale (0=not true to 2=very true/often true). Four subscales were used: Externalizing problems (32 items, alpha=.67), Attention problems (26 items, alpha=.93), Social problems (11 items, alpha=.73), and Internalizing problems (22 items, alpha=.70).

Relational victimization (3 items, alpha=.90) and Physical victimization (3 items, alpha=.83) were assessed with the PBSI and averaged to create a single Victimization score.

Maternal Involvement was assessed with the Alabama Parenting Questionnaire. Mothers completed the 10-item subscale Involvement (alpha=.81).

Parenting Stress was assessed with The Parent Domain of the Nijmegen Parenting Stress Index. Mothers rated 11 items such as "Being a parent to this child is more difficult than I thought" on a 6-point Likert scale ranging from 0 (completely disagree) to 5 (completely agree) (alpha=.76).

Maternal Depression was measured with the subscale Depressed Mood of the K10 scale, which is a short screening tool designed to monitor population prevalence of psychological distress. Mothers rated three items including "How often do you feel somber or depressed?" on a 5-point Likert scale (0=never to 4=always). Alpha was .82.

Demographics risks included Minority Status (0=Dutch, 1=non-Dutch) and Socioeconomic Status based on maternal and paternal educational level and vocation (0=average to high, 1=low).

Analysis: The Mplus program was used with Latent Class Analyses to fit a growth mixture model that included risk profiles, intervention condition, and the intervention outcomes with the goal to identify classes of individuals with similar response patterns (or risk profiles) based on the set of observed risk variables and similar growth trajectories. The estimated parameters in the model were a) latent class membership, and for each class b) class-specific baseline profiles, that is means of each latent class indicator (or risk variable), c) means of growth parameters (intercept and slope of Externalizing and Internalizing behavior), and d) an estimate of the effect of GBG on the slopes of Externalizing and Internalizing behavior. Initial data exploration showed no evidence of quadratic growth, so linear slopes were modeled. The intercept and slope estimates for the same outcomes were allowed to correlate. These correlations as well as the variances of the latent indicators and growth parameters were held constant between classes. To accommodate incomplete data, full information maximum likelihood estimation was used under the assumption of missing at random. The authors state that "the nested structure of the data was taken into account." Wald tests of parameter constraints were also used to test for differences in intervention effects between latent classes.

Outcomes

Implementation Fidelity: Teachers received three afternoons of training and 10 classroom sessions with licensed GBG supervisors per year. Supervisors gave feedback to the teachers and plans for improvement were made if needed. Through these classroom observations, supervisors checked for treatment integrity. No further details on fidelity to the intervention were provided.

Baseline Equivalence and Differential Attrition: At baseline, no differences were found between the intervention and control groups on sex, area of the country lived in, or dropout status. However, control children were from lower SES families and were less often of Dutch/Caucasian background than their GBG counterparts. Only the 741 students (97.6%) who had at least two of the four outcomes assessments were included for analysis. Inclusion/exclusion was not related to intervention condition.

Posttest: (Spilt et al., 2013) The optimal model assigned children to one of six classes. Fifty-three percent of the children (n=396) were classified into a low-risk profile (Low risk); they scored low on all baseline variables. Within this class, the GBG intervention significantly predicted change in both Externalizing and Internalizing behavior, indicating a positive intervention effect. Though there was growth in these behaviors, the increases were significantly attenuated for the GBG group as compared to the control group. Fourteen percent of the children (n=107) experienced high levels of victimization but fairly low levels of all other risk variables (Victimization risk). For this class, the intervention had the effect of decreasing Internalizing behavior. Eight percent of the children (n=60) experienced emotional disturbances; they had the highest scores on Internalizing problems and moderate scores on Social problems (Internalizing risk). The GBG decreased both Externalizing and Internalizing behavior for these children. A fourth group of children (14%, n=106) had high levels of family and demographic risk (Family and demographic risk). They experienced high Parenting stress, high Maternal depression, and low Maternal involvement. These children were often from low SES and ethnic minority families. Within this class, no significant differences were found between intervention and control children on the slopes of Externalizing and Internalizing behavior. Finally, two classes of children were found with behavioral and social problems, one with moderately high risk (Moderately-high sociobehavioral risk) and one with high levels of risk (Severe sociobehavioral risk). Seven percent of the children (n=51) scored medium high on Externalizing problems, Attention problems, Social problems, and Victimization (1 to 2 SD above the mean). No significant intervention effects were found for this class. Three percent of the children (n=21) exhibited severe levels of Externalizing problems, Attention problems, and Social problems (2.5 to 4 SD above the mean); the highest level of Victimization; and a heightened level of Internalizing problems. These children were somewhat more likely to come from low SES families. Again, the intervention had no significant effect for this class.

Wald tests indicated some differences in intervention effects between risk profiles. On the development of Externalizing behavior, in comparison to the Low risk group, significantly stronger positive effects were found for the Internalizing risk group. No significant differences were found between the Low risk group and the other risk groups, possibly due to the large inequality in group sizes which compromises statistical power to detect group differences. Significantly stronger intervention effects were found for the Internalizing risk group than for the Family and demographic risk group and the Moderately-high sociobehavioral risk group. However, effects for the Internalizing risk group did not differ from those found for the Victimization risk group and the Severe sociobehavioral risk group.

Significant class differences in intervention effects on the development of Internalizing behavior were also found. In comparison to the Low risk group, stronger intervention effects were found for the Internalizing risk group. The difference between the Low risk group and the Victimization risk group was marginally significant (p=.07). The Internalizing risk group experienced significantly stronger positive effects than all other risk groups. The Victimization risk group differed significantly from the Family and demographic risk group and marginally significantly (p=.08) from the Moderately-high sociobehavioral risk group.

(Vitvliet et al., 2009) Reductions in children's externalizing behavior and improvements in positive peer relations were found among GBG children, as compared with control group children.

Study 9

Summary

Mitchell et al. (2015) examined 68 students from three classrooms in one school who participated in the intervention, but the study lacked a control group.

Mitchell et al. (2015) reported large reductions in problem behavior but lacked significance tests.

Evaluation Methodology

Design:

Recruitment: The principal of a high school located in a Southeastern state recommended classrooms with high levels of disruptive behavior to the study. Three teachers in three classrooms agreed to participate.

Assignment: All 3 classrooms, including 68 students, participated in the intervention. Trained observers attended classes 2-4 times per week during the intervention.

Attrition: The study did not discuss attrition, although it was likely high as 1 of 3 classrooms withdrew from the study after the first phase of intervention.

Sample: The average age of students varied from 14.7-15.6 years old; most students were in the ninth grade. The sample was half female (n=34). the 68 students, 60 were black, 2 were Hispanic, 5 were biracial, and 1 was white.

Measures: Trained observers from the research project conducted 20-minute observations in 120 ten-second intervals, during which they noted any instance of the targeted behavior. Observers rotated which student was selected per ten-second interval. Instances of out-of-seat behavior, off-task behavior, and inappropriate vocalizations were recorded. In addition, the study recorded student-reported acceptability of the intervention. The study noted high interrater reliability for the measures, but observers were obviously aware of whether the program was being used or not.

Analysis: The study used Nonoverlap of All Pairs to determine overlap between baseline and treatment periods and between the withdrawal and renewed treatment periods. The overlap measure analyzed and compared each individual baseline or withdrawal datum point with each individual treatment datum point. It then calculated a weighted average across pairs of comparisons.

Intent-to-Treat: The study had no discussion of attrition or missing data. However, one classroom whose teacher failed to use the program in the second phase was dropped without any effort to follow student behavior.

Outcomes

Implementation Fidelity: The study reported integrity among the teachers between 92% and 98% during the treatment periods. This was measured by observers using a checklist for teacher implementation of the program.

Baseline Equivalence: Not applicable.

Differential Attrition: The study did not discuss differential attrition.

Posttest: The study did not present significance tests of changes, however all 3 classrooms report large reductions in problem behaviors between the baseline and first implementation period. At baseline, between 52 and 61% of observations included off-task behavior, 2-6% included out-of-seat behavior, and 30-38% included inappropriate vocalizations. At the first intervention, off-task behavior decreased to 25-32%, out-of-seat behavior decreased to 0.5-1%, and inappropriate vocalizations decreased to 8-15%. During the first withdrawal period, problem behaviors increased (although remained lower than baseline levels) and during the second implementation period, problem behaviors decreased further.

Long-Term: None reported.

Study 10

Humphrey et al. (2018) served as the main source for information on the design and results. Ashworth, Humphrey et al. (2020) focused on subgroup analyses at low, medium, and high levels of student risk. Ashworth, Panayiotou et al. (2020) used special statistical methods to examine the outcomes for students receiving a high program dosage. Troncoso & Humphrey (2021) extended the earlier studies by examining long-term outcomes.

Summary

Humphrey et al. (2018), Ashworth, Humphrey et al. (2020), Ashworth, Panayiotou et al. (2020), and Troncoso & Humphrey (2021) evaluated a large-scale implementation of the program in England, with 38 schools in the intervention group and 39 schools in the usual provision group. Measures of student reading and classroom behavior were obtained at baseline, interim, two-year posttest, and three- and four-year follow-ups.

Humphrey et al. (2018), Ashworth, Humphrey et al. (2020), Ashworth, Panayiotou et al. (2020), and Troncoso & Humphrey (2021) found no significant effects on reading, classroom behavior, and teaching skills at posttest. From the baseline assessment through four-year follow-up (Troncoso & Humphrey, 2021), relative to those in the control group, participants in the intervention group showed larger reductions in:

Concentration problems

Evaluation Methodology

Design:

Recruitment: In total, 77 primary schools from three regions of England (Greater Manchester, West and South Yorkshire, and the East Midlands) met the eligibility criteria of being state-maintained and not having already implemented the program. The sample included pupils in Year 3 classes in the first year of the trial (2015/16). After accounting for parental opt-outs (N = 68, 2.2%), 3,084 students ages 6-7 participated in the evaluation. The sample mirrored primary schools in England for size and the proportion of students speaking English as an additional language, but it contained significantly larger proportions of children with special education needs, free school meal eligibility, low rates of absence, and low rates of level 4 reading and math attainment.

Assignment: Schools were randomly allocated by an independent team using a minimization algorithm based on school size and the proportion of students receiving free school meals to one of two conditions: 1) deliver the program over a two-year period while paying a subsidized fee (N = 38 schools and 1560 students); or 2) continue as normal and receive financial compensation for participating in data collection (N = 39 schools and 1524 students).

Assessments/Attrition: Outcomes were assessed at baseline (summer term 2015), the end of the first year of the trial (interim, summer term 2016), and the conclusion of the trial (posttest, summer term 2017). At baseline, 2% were missing data for the reading test. At posttest, 18.3% were missing reading data, due to leaving school (12.6%) or absence on the day of testing (5.7%). The analysis sample consisted of those with measures at both points (N = 2504, 81%).

In Troncoso & Humphrey (2021), outcomes were assessed at baseline, interim, posttest, and three- and four-year follow-ups (summer terms 2018 and 2019). Approximately 3.8%, 18.5%, 20.8%, and 27.1% were missing data on concentration problems, prosocial behavior, and disruptive behavior at baseline, posttest, and three- and four-year follow-ups, respectively. The analysis sample consisted of 2938 children (95.3%) in 77 schools (100%).

Sample: The sample included 50-55% male students, 23-27% students eligible for free school lunches, 26-29% students speaking English as an additional language, 18-23% students with special needs education, and 13-18% students rated at risk for conduct problems.

Measures: The primary outcome measure at baseline used teacher assessments of reading that were collected as part of national tests across England in spring 2015 and obtained from the National Pupil Database. The posttest assessment used the Hodder Group Reading Test, a standardized measure of reading comprehension at word, sentence and continuous text level. Members of the research team administered the test, but the team was independent and not invested in the intervention. Every test paper was double-marked by members of the research team to eliminate human error.

Secondary measures included five ratings of children (baseline, interim, posttest, and three- and four-year follow-ups) provided by the teachers who were delivering the program through the posttest. Teacher ratings at the three- and four-year follow-ups were independent, however. All measures came from the 21-item Teacher Observation of Children's Adaptation checklist. The disruptive behavior subscale included items reflecting disobedient, disruptive, and aggressive behaviors. The concentration problems subscale included items reflecting inattentive and off-task behavior. The prosocial behavior subscale included items reflecting positive social interactions. Internal consistency of the subscales was excellent (alphas > 0.87 at baseline).

Three measures focused on teacher behavior. Teacher efficacy in classroom management was assessed using the 4-item subscale of the short-form Ohio State Teachers' Sense of Efficacy Scale (alpha = 0.90 at baseline). Teacher stress was captured using the 5-item pupil misbehavior subscale of the Teacher Stress Inventory (alpha = 0.82 at baseline). Teacher retention was assessed through the use of a single item: "How likely are you to leave the teaching profession in the next 5 years?" Responses on a 6-point scale ranged from definitely to definitely not.

Analysis: For student outcomes at posttest, the analysis used multilevel models with fixed effects, random intercepts, and controls for baseline outcomes. The two levels of students nested within schools adjusted for clustering. Checks on the main models used multiple imputation for missing data. Moderation models included cross-level interaction terms for school-level treatment by student-level gender and risk status.

For teacher-level outcomes, the analysis used single level linear regression models with follow-up scores at interim and posttest as the outcome and with controls for baseline scores. The timing of the baseline and outcomes varied depending on the grade taught by the teacher. Checks on the main models used full information maximum likelihood with robust standard errors to account for missing data.

For student outcomes at long-term follow-up (Troncoso & Humphrey, 2021), the analysis used a multilevel growth curve modeling approach that is similar to latent growth curve modeling, excepting that the multilevel specification fits a single (invariant) within-individual variance, which the authors note "is a reasonable assumption when time is treated flexibly" (p. 73). Analyses proceeded by fitting a multivariate non-linear growth curve model to the data (see Appendix A, pp. 79-80), which treats time as a cubic polynomial term, allows for the intercept to vary randomly across children and schools, allows for the slope of the linear term for time (i.e., the growth rate) to vary randomly across children, and controls for a set of baseline socioeconomic, demographic, behavioral, and school-level characteristics, as well as correlations between the outcomes. The model assumed uncorrelated within-individual residuals. The key coefficients for tests of the intervention came from coefficients for condition by time, time squared, and time cubed.

Intent-to-Treat: The principal analyses for students used fully observed data, including those with data at both baseline and posttest, but the study also reported results with partially observed data via multiple imputation or full information maximum likelihood. For student outcomes at long-term follow-up, Troncoso & Humphrey (2021) used all available data, noting that their analytic approach is advantageous because "the variance-covariance matrix is efficiently estimated even in the presence of missing data, rendering it equivalent to full information maximum likelihood (FIML) because it uses all the available information and results are therefore unbiased under the assumption of data Missing at Random (MAR)" (p. 73).

Outcomes

Implementation Fidelity: About 70% of the classes were rated as meeting fidelity or quality standards. However, the authors noted that the frequency and duration of delivery did not reach the levels expected by the developer. One-quarter of schools in the intervention arm ceased implementation before the end of the trial (although tests suggested that high implementation schools did no better, and perhaps worse, on the outcomes).

Baseline Equivalence: The authors stated that there were no significant differences between conditions on the 11 baseline measures listed in Table 4. For the four student measures, they also noted that the effect sizes were very small, ranging from 0.01 to 0.11.

Differential Attrition: Tests used logistic regression to predict missingness. For 10 baseline measures, five significantly predicted missingness (Appendix 8). The authors argued that the differences were not large and that imputation adjusted for the differences. Troncoso & Humphrey (2021) also fitted a multilevel model for missingness and did not find evidence for the main covariates (baseline outcomes, condition, three individual-level characteristics, and two school-level characteristics) predicting missingness. However, total attrition from randomization to four-year follow-up was higher in the intervention condition (30%) than the control condition (21%).

Posttest: Tests for the primary reading outcome and three secondary behavior outcomes failed to show significant program effects (Table 6). These findings also held when using multiple imputation (Appendix 9). In eight tests for moderation, two reached marginal significance (Table 6, p = .063, .053), suggesting that the program improved concentration problems and disruptive behavior for boys at risk of conduct problems.

Tests for teacher-level outcomes of classroom management, stress, and retention also did not reach statistical significance.

Ashworth, Humphrey et al. (2020). This article examined the three subscales from the Teacher Observation of Children's Adaptation checklist (disruptive behavior, concentration problems, and pro-social behavior) at the end of the two-year study. The analyses used multilevel models with random intercepts and controls for baseline outcomes and used multiple imputation to account for missing data.

For the full sample, the intervention group did not differ significantly from the control group on any of the three outcomes. Further, interaction tests showed no differences in program effects by 1) the teacher self-reported use of key program procedures, and 2) student baseline risk levels for disruptive behavior, concentration problems, and pro-social behavior. Sensitivity tests confirmed, with one minor exception, the lack of main and moderation effects for the program.

Ashworth, Panayiotou, et al. (2020). This article examined reading attainment at the end of year one and year two of the program. The outcome measure came from the Hodder Group Reading Test, a standardized measure shown to have high reliability. Dosage or compliance was assessed at the teacher level using an online GBG scoreboard system. Teachers reporting compliance above the 50th percentile or above the 75th percentile were treated as compliers.

The analysis presented multilevel intent-to-treat (ITT) and complier average causal effect (CACE) results. The CACE method compared intervention students in high dosage or complier classrooms with potential compliers in the control group who were similar on baseline characteristics to the intervention students. The authors stated that the data were missing at random (p. 227) and that the ITT analysis used full information maximum likelihood estimates to include all schools and students. The CACE models depended on the inclusion of strong predictors of compliance. Page 227 lists the predictors; for example, classrooms with low levels of baseline disruptive behavior were expected to be less likely to deliver the program with high frequency than in classrooms with high levels of disruptive behavior.

The ITT analysis found no significant effects of the program on reading attainment at either one year or two years. The CACE analysis found no effects among high or moderate compliers at one year or among high compliers at two years, but it found a significant intervention effect at two years among moderate compliers (d = .10). The authors suggested that the program needed two years for a sleeper effect to emerge.

Long-Term: For the full sample, Troncoso & Humphrey (2021) found that the intervention altered the growth trajectory for one of three outcomes, with participants in the intervention group showing larger reductions in concentration problems from the baseline assessment through four-year follow-up than those in the control group.

In subgroup analyses examining intervention effects for a subsample of children with elevated levels of conduct problems at baseline (n = 485), results indicated that the intervention altered the growth trajectory for one of three outcomes, with at-risk children in the intervention group showing larger increases in prosocial behavior from the baseline assessment through four-year follow-up. Finally, the intervention did not alter the growth trajectory for any of the three outcomes when examining only male children or a subsample of at-risk males.

Study 11

The authors refer to the third generation of the program, which involves a whole-day version implemented in the first grade. The program combines three components into one integrated intervention that enhanced the standard program in the areas of classroom behavior management, academic instruction (particularly in reading), and family-classroom partnerships. Study 1 (certified) evaluated the first generation of the program and Study 3 (not certified) evaluated the second generation of the program.

Summary

Wilcox et al. (2022) conducted a cluster randomized controlled trial that examined 961 first-grade students attending 12 elementary schools in Baltimore City, Maryland. The study randomly assigned 24 classrooms within the 12 schools to an intervention or control group. Assessments of aggressive and disruptive behavior were completed by teachers at the end of first grade and third grade.

The study found that, relative to the control group, the intervention group had significantly lower teacher ratings of

Aggressive and disruptive behavior but only for males in Cohort 2 who exhibited a stable high trajectory of aggressive and disruptive behavior
Aggressive and disruptive behavior but only for females in Cohorts 1 and 2 combined who exhibited a stable low trajectory of aggressive and disruptive behavior.

Evaluation Methodology

Design:

Recruitment: The sample came from 12 schools located in two administrative areas of Baltimore City. The schools used the standard district curriculum, had two to five first-grade classrooms, and were "performing less well academically." Students in schools came from two cohorts (n = 1073) starting in the 2003-2004 year. Excluding children missing teacher-reported baseline aggression left 961 children (90%). A third cohort of students was excluded from the study because the program had been scaled up to all classrooms.

Assignment: The study used a two-level randomized block design for the assignment. Within each of the 12 schools, all children entering first grade were randomized into first-grade classrooms before school began or at the time of assessment. Classrooms and teachers were then randomized to the intervention condition (12 classrooms, 526 students) or a school-as-usual control condition (12 classrooms, 541 students).

Assessments/Attrition: Children were assessed at baseline, mid-first grade, spring of first grade (posttest), and spring of third grade (two-year follow-up). Of 961 children with baseline data, the majority (70.2%) had complete follow-up data, and only a small number had measures only at baseline (1.0%). About 89% completed the third-grade follow-up.

Sample:

The sample consisted of 52.4% males, with a majority of African American students and students eligible for free or reduced-fee lunch.

Measures:

The study assessed three student outcomes, academic development, psychological well-being, and ratings of student aggressive and disruptive behavior, but it examined only the last one. Teachers rated the students' aggressive and disruptive behavior. Although teachers delivered the program and rated students in the first year, their ratings at the end of third grade were likely independent. The measure had good reliability

Analysis:

The analysis used general growth mixture modeling in Mplus. Because assignment occurred at the classroom level, the clustering of students in classes within schools was accounted for by computing robust standard errors. However, the level-2 sample size of 24 may not be large enough to accurately estimate the standard errors, and the result may be to overstate the significance of the tests.

A key part of the analysis involved identifying classes or groups with similar trajectories in aggressive and disruptive behavior. All models forced the baseline distribution of these classes to be constant across conditions within a cohort. The initial models combined the two cohorts in a single analysis, but other models allowed effects to vary by cohort.

Missing Data Method: The models adjusted for attrition with FIML (under the assumption that the data were missing at random). The variables most likely related to missingness (e.g., teacher baseline ratings, cohort status, gender) were included in the model, but the few remaining auxiliary variables related to missingness were not included.

Intent-to-Treat: The analysis included all 961 participants with baseline teacher ratings.

Outcomes

Implementation Fidelity:

Because the program implementation started late in Cohort 1, the students received only about 50% of the dose as Cohort 2 received. However, teachers in Cohort 2 received less support than Cohort 1 teachers.

Baseline Equivalence:

The authors stated only that "GBG youth compared to control youth did not differ in their baseline levels of aggressive behavior or class prevalence (See Online Supplement Table 2)."

Differential Attrition:

The number of missing data points did not vary significantly by gender or intervention condition. The condition difference in attrition rates at the last follow-up was small enough to meet the WWC cautious and optimistic standards. Also, the FIML estimation likely adjusted for potential differential attrition.

Posttest and Long-Term:

The study first identified three trajectory classes among males (stable low, increasing, stable high) and three trajectory classes among females (stable low, medium stable, and increasing high). Tests for intervention effects showed no main effects. Additional results were defined by trajectory class, gender, and cohort combinations.

For Cohort 1 males, none of the three classes showed a significant intervention effect.
For Cohort 2 males, the stable high trajectory class (but not the other two classes) showed a significant intervention benefit.
For both Cohort 1 and Cohort 2 females combined, the stable low trajectory class showed a significant intervention benefit.