Bad Science II: Brief, Small, and Artificial Studies

Featured

Bad Science II 072612.png

“We learned from correlational research that students who speak Latin do better in school. So this year we’re teaching everything in Latin.”

The oldest joke in academia goes like this. A professor is shown the results of an impressive experiment. “That may work in practice,” she says, “but how will it work in the laboratory?”

For practitioners trying to make sense of the findings of educational research, this is no laughing matter. They are often left to figure out whether or not there is meaningful evidence supporting a given practice or policy. Yet all too often academics report findings from experiments that are too brief, too small, and/or too artificial to be reliable for making educational decisions.

Looking at the original articles, this problem is easy to see. Would you use or recommend a classroom management approach that has been successfully evaluated in a one-hour experiment? Or one evaluated with only 20 students? Or evaluated in a situation in which teachers in the experimental group had graduate students helping them in class every day?

The problem comes when busy educators or researchers rely on reviews of research. The reviews may make sweeping statements about the about the effects of various practices based on very brief, small, or artificial experiments, yet a lot of detective work may be necessary to find this out. Years ago, I was re-analyzing a review of research on class size and found one study with a far larger effect than all others. After much sleuthing I found out why: It was a study of tennis instruction, where students in larger tennis groups get a lot less court time.

So what should a reader do? Some reviews, including Social Programs that WorkBlueprints for Violence Prevention, and our own Best Evidence Encyclopedia, take sample size, duration, and artificiality into account. Otherwise, if you want to know for sure, you’ll have to put on your own deerstalker and do your own detective work, finding the essential experiments that took place in real schools over real periods of time, under realistic conditions. Evidence-based reform in education won’t really take hold until readers can consistently find reliable, easily interpretable and unbiased information on practical programs and practices available to them.

In case you missed last week first part in the series, check it out here: Bad Science I: Bad Measures

Illustration: Slavin, R.E. (2007). Educational research in the age of accountability. Boston: Allyn & Bacon. Reprinted with permission of the author.

Find Bob Slavin on Facebook!

Advertisements

On Motivation

Once upon a time there was a man standing on a city street selling pencils from a tin cup. An old friend came by and recognized him.

“Hank!” said his friend. “What happened to you? Didn’t you have a big job at the Acme Dog Food Company?”

Hank hung his head. “I did,” he said mournfully. “I was its chief scientist. But it closed down, and it was all my fault!”

“What happened?” asked his friend.

“We decided to make the best dog food ever. We got together the top experts in dog nutrition in the whole world to find out what dogs really need. We put in the very best ingredients, no matter what they cost.”

“That sounds wonderful!” exclaimed the friend.

“It sounded great,” sighed Hank, “but the darned dogs wouldn’t eat it!”

In educational development, research, and dissemination, I think we often make the mistake made by the mythical Acme Dog Food Company. We create instructional materials and software completely in accord with everything the experts recommend. Today, for example, someone might make a program that is aligned with the Common Core or other college- and career-readiness standards, that uses personalization and authentic problem solving, and so on. Not that there is anything wrong with these concepts, but are they enough?

The key factor, I’d argue, is motivation. No matter how nutritious our instruction is, it has to appeal to the kids. In a review of secondary reading programs my colleagues and I wrote recently (www.bestevidence.org), most of the programs evaluated were 100% in accord with what the experts suggest. In particular, most of them emphasized the teaching of metacognitive skills, which has long been the touchstone for secondary reading, and many also provided an extra instructional period every day, in accord with the popular emphasis on extra-time strategies.

However, the approaches that made the biggest differences in reading outcomes were not those that provided extra time. They included small-group or individual tutoring approaches, cooperative learning, BARR (a program focusing on building relationships between teachers and students), and a few technology approaches. The successful approaches usually included metacognitive skills, but so did many programs that did not show positive outcomes.

What united the successful strategies is that they all get to the head through the heart.

Tutoring allows total personalization of instruction, but it also lets tutors and students build personal, close relationships. BARR (Building Assets, Reducing Risks) is all about building personal relationships. Cooperative learning focuses on building relationships among students, and adding an element of fun and engagement to daily lessons. Some technology programs are also good at making lessons fun and engaging.

I can’t say for sure that these were the factors that made the difference in learning outcomes, but it seems likely. I’d never say that instructional content and strategies don’t matter. They do. But the very best teaching methods with the very best content are unlikely to enhance learning very much unless they make the kids eager to learn.

Half a Worm: Why Education Policy Needs High Evidence Standards

There is a very old joke that goes like this:

What’s the second-worst thing to find in your apple?  A worm.

What’s the worst?  Half a worm.

The ESSA evidence standards provide clearer definitions of “strong,” “moderate,” and “promising” levels of evidence than have ever existed in law or regulation. Yet they still leave room for interpretation.  The problem is that if you define evidence-based too narrowly, too few programs will qualify.  But if you define evidence-based too broadly, it loses its meaning.

We’ve already experienced what happens with a too-permissive definition of evidence.  In No Child Left Behind, “scientifically-based research” was famously mentioned 110 times.  The impact of this, however, was minimal, as everyone soon realized that the term “scientifically-based” could be applied to just about anything.

Today, we are in a much better position than we were in 2002 to insist on relatively strict evidence of effectiveness, both because we have better agreement about what constitutes evidence of effectiveness and because we have a far greater number of programs that would meet a high standard.  The ESSA definitions are a good consensus example.  Essentially, they define programs with “strong evidence of effectiveness” as those with at least one randomized study showing positive impacts using rigorous methods, and “moderate evidence of effectiveness” as those with at least one quasi-experimental study.  “Promising” is less well-defined, but requires at least one correlational study with a positive outcome.

Where the half-a-worm concept comes in, however, is that we should not use a broader definition of “evidence-based”.  For example, ESSA has a definition of “strong theory.”  To me, that is going too far, and begins to water down the concept.  What program in all of education cannot justify a “strong theory of action”?

Further, even in the top categories, there are important questions about what qualifies. In school-level studies, should we insist on school-level analyses (i.e., HLM)? Every methodologist would say yes, as I do, but this is not specified. Should we accept researcher-made measures? I say no, based on a great deal of evidence indicating that such measures inflate effects.

Fortunately, due to investments made by IES, i3, and other funders, the number of programs that meet strict standards has grown rapidly. Our Evidence for ESSA website (www.evidenceforessa.org) has so far identified 101 PK-12 reading and math programs, using strict standards consistent with ESSA definitions. Among these, more than 60% meet the “strong” standard. There are enough proven programs in every subject and grade level to give educators choices among proven programs. And we add more each week.

This large number of programs meeting strict evidence standards means that insisting on rigorous evaluations, within reason, does not mean that we end up with too few programs to choose among. We can have our apple pie and eat it, too.

I’d love to see federal programs of all kinds encouraging use of programs with rigorous evidence of effectiveness.  But I’d rather see a few programs that meet a strict definition of “proven” than to see a lot of programs that only meet a loose definition.  20 good apples are much better than applesauce of dubious origins!

This blog is sponsored by the Laura and John Arnold Foundation

Proven Tutoring Approaches: The Path to Universal Proficiency

There are lots of problems in education that are fundamentally difficult. Ensuring success in early reading, however, is an exception. We know what skills children need in order to succeed in reading. No area of teaching has a better basis in high-quality research. Yet the reading performance of America’s children is not improving at an adequate pace. Reading scores have hardly changed in the past decade, and gaps between white, African-American, and Hispanic students have been resistant to change.
In light of the rapid growth in the evidence base, and of the policy focus on early reading at the federal and state levels, this is shameful. We already know a great deal about how to improve early reading, and we know how to learn more. Yet our knowledge is not translating into improved practice and improved outcomes on a large enough scale.
There are lots of complex problems in education, and complex solutions. But here’s a really simple solution:

 

Over the past 30 years researchers have experimented with all sorts of approaches to improve students’ reading achievement. There are many proven and promising classroom approaches, and such programs should be used with all students in initial teaching as broadly as possible. Effective classroom instruction, universal access to eyeglasses, and other proven approaches could surely reduce the number of students who need tutors. But at the end of the day, every child must read well. And the only tool we have that can reliably make a substantial difference at scale with struggling readers is tutors, using proven one-to-one or small-group methods.

I realized again why tutors are so important in a proposal I’m making to the State of Maryland, which wants to bring all or nearly all students to “proficient” on its state test, the PARCC. “Proficient” on the PARCC is a score of 750, with a standard deviation of about 50. The state mean is currently around 740. I made a colorful chart (below) showing “bands” of scores below 750 to show how far students have to go to get to 750.

 

Each band covers an effect size of 0.20. There are several classroom reading programs with effect sizes this large, so if schools adopted them, they could move children scoring at 740 to 750. These programs can be found at www.evidenceforessa.org. But implementing these programs alone still leaves half of the state’s children not reaching “proficient.”

What about students at 720? They need 30 points, or +0.60. The best one-to-one tutoring can achieve outcomes like this, but these are the only solutions that can.

Here are mean effect sizes for various reading tutoring programs with strong evidence:

 

 

As this chart shows, one-to-one tutoring, by well-trained teachers or paraprofessionals using proven programs, can potentially have the impacts needed to bring most students scoring 720 (needing 30 points or an effect size of +0.60) to proficiency (750). Three programs have reported effect sizes of at least +0.60, and several others have approached this level. But what about students scoring below 720?

So far I’ve been sticking to established facts, studies of tutoring that are, in most cases, already being disseminated. Now I’m entering the region of well-justified supposition. Almost all studies of tutoring occupy just one year or less. But what if the lowest achievers could receive multiple years of tutoring, if necessary?

One study, over 2½ years, did find an effect size of +0.68 for one-to-one tutoring. Could we do better that that? Most likely. In addition to providing multiple years of tutoring, it should be possible to design programs to achieve one-year effect sizes of +1.00 or more. These may incorporate technology or personalized approaches specific to the needs of individual children. Using the best programs for multiple years, if necessary, could increase outcomes further. Also, as noted earlier, using proven programs other than tutoring for all students may increase outcomes for students who also receive tutoring.

But isn’t tutoring expensive? Yes it is. But it is not as expensive as the costs of reading failure: Remediation, special education, disappointment, and delinquency. If we could greatly improve the reading performance of low achievers, this would of course reduce inequities across the board. Reducing inequities in educational outcomes could reduce inequities in our entire society, an outcome of enormous importance.

Even providing a substantial amount of teacher tutoring could, by my calculations, increase total state education expenditures (in Maryland) by only about 12%. These costs could be reduced greatly or even eliminated by reducing expenditures on ineffective programs, reducing special education placements, and other savings. Having some tutoring done by part time teachers may reduce costs. Using small-group tutoring (fewer than 6 students at a time) for students with milder problems may save a great deal of money. Even at full cost, the necessary funding could be phased in over a period of 6 years at 2% a year.

The bottom line is that the low levels of achievement and high levels of gaps according to economic and racial differences could be improved a great deal using methods already proven to be effective and already widely available. Educators and policy makers are always promising policies that bring every child to proficiency: “No Child Left Behind” and “Every Student Succeeds” come to mind. Yet if these outcomes are truly possible, why shouldn’t we be pursuing them, with every resource at our disposal?

How Networks of Proven Programs Could Help State-Level Reform

America is a great country, but it presents a serious problem for school reformers. The problem is that it is honkin’ humongous, with strong traditions of state and local autonomy. Reforming even a single state is a huge task, because most of our states are the size of entire small nations. (My small state, Maryland, has about the population of Scotland, for example.) And states, districts, schools, and teachers are all kind of prickly about taking orders from anyone further up the hierarchy.

The Every Student Succeeds Act (ESSA) puts a particular emphasis on state and local control, a relief after the emphasis on mandates from Washington central to No Child Left Behind. ESSA also contains a welcome focus on using evidence-based programs.

ESSA is new, and state, district and school leaders are just now grappling with how to use the ESSA opportunities to move forward on a large scale. How can states hope to bring about major change on a large scale, working one school at a time?

The solution to this problem might be for states, large districts, or coalitions of smaller districts to offer a set of proven, whole school reform models to a number of schools in need of assistance, such as Title I schools. School leaders and their staffs would have opportunities to learn about programs, find some appropriate to their needs, ideally visit schools using the programs now, and match the programs with their own needs, derived from a thorough needs assessment. Ultimately, all school staff might vote, and at least 80% would have to vote in favor. The state or district would set aside federal or state funds to enable schools to afford the program they have chosen.

All schools in the state, district, or consortium that selected a given program could then form a network. The network would have regular meetings among principals, teachers of similar grades, and other job-alike staff members, to provide mutual help, share ideas, and interact cost-effectively with representatives of program providers. Network members would share a common language, and drawing from common experiences could be of genuine help to each other. The network arrangement would also reduce the costs of adopting each program, because it would create local scale to reduce costs of training and coaching.

The benefits of such a plan would be many. First, schools would be implementing programs they selected, and school staffs would be likely to put their hearts and minds into making the program work. Because the programs would all have been proven to be effective in the first place, they would be very likely to be measurably effective in these applications.

There might be schools that would initially opt not to choose anything, and this would be fine. Such schools would have opportunities each year to join colleagues in one of the expanding networks as they see that the programs are working in their own districts or regions.

As the system moved forward, it would become possible to do high-quality evaluations of each of the programs, contributing to knowledge of how each program works in particular districts or areas.

As the number of networked schools increased across a given state, it would begin to see widespread and substantial gains on state assessments. Further, all involved in this process would be learning not only the average effectiveness of each program, but also how to make each one work, and how to use programs to succeed with particular subgroups or solve particular problems. Networks, program leaders, and state, district, and school leaders, would get smarter each year about how to use proven programs to accelerate learning among students.

How could this all work at scale? The answer is that there are nonprofit organizations and companies that are already capable of working with hundreds of schools. At the elementary level, examples include the Children’s Literacy Initiative, Positive Action, and our own Success for All. At the secondary level, examples include BARR, the Talent Development High School, Reading Apprenticeship, and the Institute for Student Achievement. Other programs currently work with specific curricula and could partner with other programs to provide whole-school approaches, or some schools may only want or need to work on narrower problems. The programs are not that expensive at scale (few are more than $100 per student per year), and could be paid for with federal funds such as school improvement, Title I, Title II, and Striving Readers, or with state or local funds.

The proven programs do not ask schools to reinvent the wheel, but rather to put their efforts and resources toward adopting and effectively implementing proven programs and then making necessary adaptations to meet local needs and circumstances. Over time this would build capacity within each state, so that local people could take increasing responsibility for training and coaching, further reducing costs and increasing local “flavor.”

We’ve given mandates 30 years to show their effectiveness. ESSA offers new opportunities to do things differently, allowing states and districts greater freedom to experiment. It also strongly encourages the use of evidence. This would be an ideal time to try a simple idea: use what works.

This blog is sponsored by the Laura and John Arnold Foundation

Where Will the Capacity for School-by-School Reform Come From?

In recent months, I’ve had a number of conversations with state and district leaders about implementing the ESSA evidence standards. To its credit, ESSA diminishes federal micromanaging, and gives more autonomy to states and locals, but now that the states and locals are in charge, how are they going to achieve greater success? One state department leader described his situation in ESSA as being like that of a dog who’s been chasing cars for years, and then finally catches one. Now what?

ESSA encourages states and local districts to help schools adopt and effectively implement proven programs. For school improvement, portions of Title II, and Striving Readers, ESSA requires use of proven programs. Initially, state and district folks were worried about how to identify proven programs, though things are progressing on that front (see, for example, www.evidenceforessa.org). But now I’m hearing a lot more concern about capacity to help all those individual schools do needs assessments, select proven programs aligned with their needs, and implement them with thought, care, and knowledgeable application of implementation science.

I’ve been in several meetings where state and local folks ask federal folks how they are supposed to implement ESSA. “Regional educational labs will help you!” they suggest. With all due respect to my friends in the RELs, this is going to be a heavy lift. There are ten of them, in a country with about 52,000 Title I schoolwide projects. So each REL is responsible for, on average, five states, 1,400 districts, and 5,200 high-poverty schools. For this reason, RELs have long been primarily expected to work with state departments. There are just not enough of them to serve many individual districts, much less schools.

State departments of education and districts can help schools select and implement proven programs. For example, they can disseminate information on proven programs, make sure that recommended programs have adequate capacity, and perhaps hold effective methods “fairs” to introduce people in their state to program providers. But states and districts rarely have capacity to implement proven programs themselves. It’s very hard to build state and local capacity to support specific proven programs. For example, due to frequent downturns in state or district funding come, the first departments to be cut back or eliminated often involve professional development. For this reason, few state departments or districts have large, experienced professional development staffs. Further, constant changes in state and local superintendents, boards, and funding levels, make it difficult to build up professional development capacity over a period of years.

Because of these problems, schools have often been left to make up their own approaches to school reform. This happened on a wide scale in the NCLB School Improvement Grants (SIG) program, where federal mandates specified very specific structural changes but left the essentials, teaching, curriculum, and professional development, up to the locals. The MDRC evaluation of SIG schools found that they made no better gains than similar, non-SIG schools.

Yet there is substantial underutilized capacity available to help schools across the U.S. to adopt proven programs. This capacity resides in the many organizations (both non-profit and for-profit) that originally created the proven programs, provided the professional development that caused them to meet the “proven” standard, and likely built infrastructure to ensure quality, sustainability, and growth potential.

The organizations that created proven programs have obvious advantages (their programs are known to work), but they also have several less obvious advantages. One is that organizations built to support a specific program have a dedicated focus on that program. They build expertise on every aspect of the program. As they grow, they hire capable coaches, usually ones who have already shown their skills in implementing or leading the program at the building level. Unlike states and districts that often live in constant turmoil, reform organizations or for-profit professional development organizations are likely to have stable leadership over time. In fact, for a high-poverty school engaged with a program provider, that provider and its leadership may be the only partner stable enough to be likely to be able to help them with their core teaching for many years.

State and district leaders play major roles in accountability, management, quality assurance, and personnel, among many other issues. With respect to implementation of proven programs, they have to set up conditions in which schools can make informed choices, monitor the performance of provider organizations, evaluate outcomes, and ensure that schools have the resources and supports they need. But truly reforming hundreds of schools in need of proven programs one at a time is not realistic for most states and districts, at least not without help. It makes a lot more sense to seek capacity in organizations designed to provide targeted professional development services on proven programs, and then coordinate with these providers to ensure benefits for students.

This blog is sponsored by the Laura and John Arnold Foundation

Little Sleepers: Long-Term Effects of Preschool

In education research, a “sleeper effect” is not a way to get all of your preschoolers to take naps. Instead, it is an outcome of a program that appears not immediately after the end of the program, but some time afterwards, usually a year or more. For example, the mother of all sleeper effects was the Perry Preschool study, which found positive outcomes at the end of preschool but no differences throughout elementary school. Then positive follow-up outcomes began to show up on a variety of important measures in high school and beyond.

Sleeper effects are very rare in education research. To see why, consider a study of a math program for third graders that found no differences between program and control students at the end of third grade, but then a large and significant difference popped up in fourth grade or later. Long-term effects of effective programs are often seen, but how can there be long-term effects if there are no short-term effects on the way? Sleeper effects are so rare that many early childhood researchers have serious doubts about the validity of the long-term Perry Preschool findings.

I was thinking about sleeper effects recently because we have recently added preschool studies to our Evidence for ESSA website. In reviewing the key studies, I was once again reading an extraordinary 2009 study by Mark Lipsey and Dale Farran.

The study randomly assigned Head Start classes in rural Tennessee to one of three conditions. Some were assigned to use a program called Bright Beginnings, which had a strong pre-literacy focus. Some were assigned to use Creative Curriculum, a popular constructive/developmental curriculum with little emphasis on literacy. The remainder were assigned to a control group, in which teachers used whatever methods they ordinarily used.

Note that this design is different from the usual preschool studies frequently reported in the newspaper, which compare preschool to no preschool. In this study, all students were in preschool. What differed is only how they were taught.

The results immediately after the preschool program were not astonishing. Bright Beginnings students scored best on literacy and language measures (average effect size = +0.21 for literacy, +0.11 for language), though the differences were not significant at the school level. There were no differences at all between Creative Curriculum and control schools.

Where the outcomes became interesting was in the later years. Ordinarily in education research, outcomes measured after the treatments have finished diminish over time. In the Bright Beginnings/Creative Curriculum study the outcomes were measured again when students were in third grade, four years after they left school. Most students could be located because the test was the Tennessee standardized test, so scores could be found as long as students were still in Tennessee schools.

On third grade reading, former Bright Beginnings students now scored significantly better than former controls, and the difference was statistically significant and substantial (effect size = +0.27).

In a review of early childhood programs at www.bestevidence.org, our team found that across 16 programs emphasizing literacy as well as language, effect sizes did not diminish in literacy at the end of kindergarten, and they actually doubled on language measures (from +0.08 in preschool to +0.15 in kindergarten).

If sleeper effects (or at least maintenance on follow-up) are so rare in education research, why did they appear in these studies of preschool? There are several possibilities.

The most likely explanation is that it is difficult to measure outcomes among four year-olds. They can be squirrely and inconsistent. If a pre-kindergarten program had a true and substantial impact on children’s literacy or language, measures at the end of preschool may not detect it as well as measures a year later, because kindergartners and kindergarten skills are easier to measure.

Whatever the reason, the evidence suggests that effects of particular preschool approaches may show up later than the end of preschool. This observation, and specifically the Bright Beginnings evaluation, may indicate that in the long run it matters a great deal how students are taught in preschool. Until we find replicable models of preschool, or pre-k to 3 interventions, that have long-term effects on reading and other outcomes, we cannot sleep. Our little sleepers are counting on us to ensure them a positive future.

This blog is sponsored by the Laura and John Arnold Foundation

Getting Past the Dudalakas (And the Yeahbuts)

Phyllis Hunter, a gifted educator, writer, and speaker on the teaching of reading, often speaks about the biggest impediments to education improvement, which she calls the dudalakas. These are excuses for why change is impossible.  Examples are:

Dudalaka         Better students

Dudalaka         Money

Dudalaka         Policy support

Dudalaka         Parent support

Dudalaka         Union support

Dudalaka         Time

Dudalaka is just shorthand for “Due to the lack of.” It’s a close cousin of “yeahbut,” another reflexive response to ideas for improving education practices or policy.

Of course, there are real constraints that teachers and education leaders face that genuinely restrict what they can do. The problem with dudalakas and yeahbuts is not that the objections are wrong, but that they are so often thrown up as a reason not to even think about solutions.

I often participate in dudalaka conversations. Here is a composite. I’m speaking with a principal of an elementary school, who is expressing concern about the large number of students in his school who were struggling in reading. Many of these students were headed for special education. “Could you provide them with tutors?” I ask. “Yes, they get tutors, but we use a small group method that emphasizes oral reading (not the phonics skills that the students are actually lacking) (i.e., yeahbut).”

“Could you change the tutoring to focus on the skills you know students need?”

“Yeahbut our education leadership requires we use this system” (dudalaka political support). Besides, we have so many failing students (dudalaka better students) so we have to work with small groups of students (dudalaka tutors).”

“Could you hire and train paraprofessionals or recruit qualified volunteers to provide personalized tutoring?”

“Yeahbut we’d love to, but we can’t afford them (dudalaka money). Besides, we don’t have time for tutoring (dudalaka time).”

“But you have plenty of time in your afternoon schedule.”

“Yeahbut in the afternoon, children are tired. (Dudalaka better students).”

This conversation is not of course a rational discussion of strategies for solving a serious problem. It is instead an attempt by the principal to find excuses to justify his school’s continuing to do what it is doing now. Dudalakas and yeahbuts are merely ways of passing blame to other people (school leaders, teachers, children, parents, unions, and so on) and to shortages of money, time, and other resources that hold back change. Again, these excuses may or may not be valid in a particular situation, but there is a difference between rejecting potential solutions out of hand (using dudalakas and yeahbuts) as opposed to identifying and then carefully and creatively considering potential solutions. Not every solution will be possible or workable, but if the problem is important, some solution must be found. No matter what.

An average American elementary school with 500 students has an annual budget of approximately $6,000,000 ($12,000 per student). Principals and teachers, superintendents, and state superintendents think their hands are tied by limited resources (dudalaka money). But creativity and commitment to core goals can overcome funding limitations if school and district leaders are willing to use resources differently or activate underutilized resources, or ideally, find a way to obtain more funding.

The people who start off with the very human self-protective dudalakas and yeahbuts may, with time, experience, and encouragement, become huge advocates for change. It’s only natural to start with dudalakas and yeahbuts. What is important is that we don’t end with them.

We know that our children are capable of succeeding at much higher rates than they do today. Yet too many are failing, dudalaka quality implementation of proven programs. Let’s clear away the other dudalakas and yeahbuts, and get down to this one.

This blog is sponsored by the Laura and John Arnold Foundation