What Is Research and Evaluation Evidence and How Can We Use It?

Phelan Wyrick: Welcome to the session this morning. We are thrilled to have the opportunity to present today at the NIJ Conference and talk to you about what I've been sort of calling “What Counts as Evidence?”, and that's shorthand, but it's a major point of discussion these days, more and more discussion around evidence-based practices, evidence-based programs. You hear more and more legislators at different levels, policymakers talking about evidence-based work, and, in general, our experience is that there's a fair amount of confusion about what that means and how these terms are used and the definitions of these terms.

We don't necessarily claim to be coming to full conclusions here today, but we offer you our insights and share with you some of the work that's been done to add some clarity, I think, to the terms and the approaches that would constitute evidence-based practice and programs.

We've seen as — at this time as there's a number of pressures that are bringing this particular issue to the forefront. One of them is the administration itself. You'll recall in his inaugural address, President Obama talked about not the distinction between big or small government but government that works, and focusing in on trying to support those programs that work and discontinue support for things that don't work.

That was a general statement about effective governance, and I think at its core, a lot of what we are talking about when we think of evidence, when we think about what counts as evidence and evidence-based programs, is really at its core about effective governance. But, of course, we're also in an economic crisis, and that puts another level of pressure on states, localities and federal government for us to be thinking about this issue and to be focusing in on using our investments most wisely. As we know, departments, police departments all across the country are reducing their staff levels, losing officers. You've got prisons and jails that are overcrowded. You've got state and city budgets that have to be balanced, and so more and more attention is coming back to this issue.

Today, we've got three presentations, and I will reiterate one of the notes that is in your materials that we've had to reduce our presenters to the three of us.

Mike Farrell is not able to join us unfortunately, but we do have Stephanie Shipman here from the General Accounting Office, and she is going to talk to you about a recent work that they've done and a report that, if you haven't seen it, I highly recommend, on program evaluation and a variety of rigorous methods. And it talks about a variety of rigorous methods to identify effective interventions, and so she'll talk about that work.

And then we've also got Ed McGarrell from Michigan State University, director of their criminal justice programs and a professor there, who's worked extensively on gun programs including the Department of Justice's Project Safe Neighborhoods.

And then I will be talking about work that we are doing out of the Office of Justice Programs that we call our “Evidence Integration Initiative,” and I should quickly introduce myself. I'm Phelan Wyrick. I'm a senior adviser with the Office of the Assistant Attorney General in the Office of Justice Programs.

Stephanie Shipman: All right. Thank you. I'm glad to be here.

I'll preface my remarks, I do not have any background in the justice area. Most of my work is actually government-wide. I specialize at the Center for Evaluation, Methods and Issues in trying to further program evaluation across the federal government, and so we do that through a variety of studies of agency evaluation activities, use of different kinds of methods for addressing various analytical problems, and then we do a lot of outreach, so this is part of that.

And what I wanted to do was, before I actually talk about our report, I wanted to — because I don't want to presume that the answer to the question, what is rigorous evidence, is only what I'm going to be talking about today and in our report, but rather, in terms of prime uses of research and evaluation in program management, I would push hard for — in all program management, that you'd be using performance monitoring and environmental scans to find out what's going on around you in the context that your program is operating, so you keep up to date, and use process and outcome evaluations to find out how to improve your program over time. And then there's a question, particularly at the federal level, about making large, broad-based decisions for national programs, what do we want other people to adopt, right, as opposed to managing your own program. And that's where we start getting to these questions about rigorous evidence of a program impact that I'm going to spend most of my time on.

As Phelan was talking already spoke to this issue about big pressure on trying to get more improved government performance with more information about what works, right? Also, as they've started reviewing the evaluation evidence for some of the federal programs, get very disappointed at the quality of the vast majority of the program evaluations that they've seen, and so there's been a push for more rigorous evaluation, and, particularly, there's been a push for more experimental methods to improve the estimates of program impact.

And where this comes from — and I'm assuming everybody has a certain amount of research background, but not a lot, OK, so the basic assumption, why would you use random assignment. When we're trying to assess whether any of our social programs work, we often use a method that actually started back in agricultural research and is adopted in medicine and is adopted to some extent but not as much in the social services, is we're trying to control. We know that the outcomes we're looking for are influenced by a wide variety of other factors outside the program, OK. All the really interesting stuff that happens is not completely controlled by our programs. Get past the IRS Debt Collection.

And so the idea is that we're going to look at the experience of people who are in the program or are receiving the treatment and those who did not receive the treatment, look at their results at the end and see if they're different.

Now, if we randomly assign them to those comparison groups, there shouldn't be anything significantly different between those groups except their exposure to this program. So then your deduction at the end is, “Well, if we see a difference in their outcomes at the end of our study, it must be the result of the program.” OK? That's the logic we're working with.

Now, in pushes to try to focus federal funds on effective programs, the Coalition for Evidence Based Policy, which is a non profit group, had suggested a — what they call the “Top Tier Initiative” to review evaluation evidence, to identify those interventions that were really, really strong, the ones that had been tested with experimental designs and found to have strong outcomes.

Senate Homeland Security and Government Affairs Committee then requested GAO, us, to examine this Top Tier Evidence Initiative, how is that working, and, specifically, they said, “Well, what about this criteria of only those being found effective in randomized controlled experiments? Is that too narrow? What kinds of interventions would be best suited for that kind of criterion, and what types of interventions would not actually be — would this be a good criteria for?” OK? And then, effectively, what else could we — what else — what other methods could we use? Right? The latter is what I'm going to talk about today.

We went about this essentially by reviewing the evaluation methods literature. We've got decades of people talking about this problem, and summarized it in terms of, however — what kinds of programs or interventions. “Interventions” is a tighter name about — it's like what specific activities. And then did a summary and then worked with evaluation experts, senior folks, to review what we come to summarize and gives a sense of are we capturing sort of the consensus view.

The first thing that you come up with, the first point all of them made is that you wouldn't be looking at an effectiveness evaluation at all unless the intervention is important, and you actually want to know whether it's worth somebody else adopting it. Right? It's got to be clearly defined, so we know what it is and what it isn't, and well implemented, so we don't make the mistake of saying, “Oh. Well, we tested it,” and then find out that actually it was never put in place in the first place. They just said they had.

And second — thirdly, the study needs to be adequately resourced because some of the problem with the evaluation literature is that we have not spent enough resources to make sure that we got credible evidence, that we got, let's say, good survey responses, and so there were lots of questions left at the end of the study. This is a waste of resources. We don't have that many program evaluation resources, frankly, in the federal government to be spending it on studies that are not sufficiently well done to provide a conclusive answer.

The next point is it actually has to be a program where we need to worry about this, this external factors business, and, like I said, there are some programs like IRS Debt Collection. It's really pretty simple. You know, we want them to collect the right monies from the right people and do it in a reasonable period of time. Those are not the programs we're talking about. We're talking about the ones where we're trying to influence human behavior usually, and there's a lot of other factors going on besides our program.

Now, so now we've passed those hurdles, random assignment is absolutely considered very strong design when it's possible, practical and ethical.

First issue is you actually — the evaluator needs to control exposure to the intervention. We can't have people volunteering because we don't know if there is something different between those who volunteer and those who don't. The evaluator, in order to do this random assignment, we have to control who is exposed to the program.

Second, there's got to be limited coverage of the population, so we have somebody left over who's not receiving the intervention, who's not in the program. This sounds so obvious. Right?

The comparison groups have to be kept separate and distinct throughout the study. This is important, particularly in social interventions, because we're often working with knowledge, attitudes that are affecting behavior.

Sometimes the people who have learned that they're in the special new program behave differently because they perceive themselves differently and they perceive what they're actually experiencing a little differently. You want to keep out these other reasons why we might see differences at the end of the study. Right? So we usually want to keep them apart, and we want to make sure that the comparison group is not going out and getting the same services from somebody else out in the community.

And outcomes have to be observed within a reasonable time frame. This is the practical issue. Right? If you have to wait a long time to see these outcomes occur later, several years later in people's lives, you're going to have to follow them all that time. Right? So you are talking expense.

Thinking about the programs, you just flip that on the other side, you can start to see some of the issues. Entitlement programs, these are pensions, veterans' benefits and the like. Every single person who is eligible for those benefits must receive them. We cannot make decisions about withholding those.

Similarly, laws that prescribe particular activities occur or not occur, we cannot selectively apply those laws and randomly assign some people to be exposed to them and other people not to be.

You're thinking of state laws, yes, but I can't randomly assign which legislator will pass which version of those laws. Right? We can make comparisons but not random assignment.

Broadcast media. Radio, TV, Internet. Right? We don't control who's exposed. You, the participants, turn on the radio, turn it off when you hear that stupid message you get. Right? You skip over the ads and the other thing. You have control over whether you're exposed to those messages. Right? The evaluator isn't.

Comprehensive social reforms much more occurred in the last 10, 15 years. This is an effort to say what we want to do is have a variety of multiple interventions and activities going on in a community to try to change some of the relationships, perhaps between institutions or between people in the neighborhood and their institutions, and what you're trying to do is initiate a variety of organic activity. This is very hard to control, identify and randomize, therefore, so it's not a good choice for the randomization.

Negative events. And this is, you know, beyond we can't control hurricanes. We also are usually not allowed to expose people to a real substantive risk of harm just to find out whether our preventive or remedial programs are effective. We have to wait for those, those negative events to occur.

Random assignment is also not practical in a variety of settings. In the welfare area, where I've spent a lot of my time, you'll find staff — and this also happens in education where you're talking about needy populations that needs particular services. Staff will be very unwilling to deny services to people who they believe deserve them. So what you have to do is try to play with maybe alternative services or things like that. You have to work hard in some of those situations. It can be done, but it's going to be difficult.

Rare events and long time lags. Rare events is the same kind of problem, got to have large sample sizes in order to get statistically significant differences. It is possible. That's not the problem. The problem is expensive. So, if we've already decided this is an important question to understand, that we really get a good estimate of the effectiveness of these interventions, then, by all means, spend the money, but don't do it hoping that, you know, you can do it on the cheap.

And, finally, the broad, flexible programs. For us at the federal level, this is a big issue with a lot of the, what we call, “formula” or “block” grants, where we actually encourage localities to all be using different services, treat different populations. It makes it very difficult now to try to arrange a randomized trial experiment.

So, but there are alternatives. We are not painted into a box saying that we can't do anything else. We can, and there are a variety of rigorous methods that are available. I am going to do a very quick cartoon picture of the big — of the standard ones. By all means, read our report, and go through any major text and look for help to get more details.

The first is obvious, is you having a non randomized comparison group; that is, you often have a new intervention that is being tested out someplace, and you're selecting people from the rest of the population who look as similar as possible to these folks in order to try to deal with that problem, is there anything substantially different between them to begin with. You have to be very careful and check all the baseline characteristics beforehand and the like.

Regression discontinuity analysis is not as common. It's a very powerful design. Here, we actually are assigning people into two groups, not randomly though, but deliberately. We use a pretest, a baseline measure that is quantitative, and say everybody below a certain score will be in the program, receiving services, and everybody above it will not. This gets at that “I don't want to withhold services from my needy population” problem.

Then what you do is you just analyze the data on the people right around the cutoff because, frankly, they're pretty similar. They're very similar compared to everybody who is brought in. Now, the catch is that this is very expensive. You're going to throw out the data on the rest of the group. So — but it is very powerful.

Alternatively, you have statistical analysis as observational data. We don't have control over who's exposed at all. So what we do is we measure their exposure to the program. When you think of dosage in medicine, it's like, you know, how much, how much violence on TV do you watch, those kinds of studies. Right?

And then we also measure the outcome variable on those people, and we correlate to see if there's a relationship between the extent and whether or not there's exposure to the program, and then whether that's related to the outcomes. You're going to be using other statistical analysis to make sure there aren't any other differences associated with that exposure choice, right, to see if that might not be explaining. You got more statistical analysis to do than in the randomized trials.

Interrupted time series analysis is like performance monitoring. You're collecting data and outcome for a period of time before a policy change which occurs to everybody, OK, like those law changes, a big moment in time, and then watch the data change afterwards or not. Right?

OK. Long series, you're usually using administrative data. You're probably doing some other statistical analyses to deal with the waviness. Maybe there's time cycles, you know, that deal with it in your outcome variable.

And, finally, for the really tough one on the comprehensive reforms where everything is going on and you really don't have control over it, the best people could come up with, because I really can't find a good comparison site, is to use an in-depth case study.

And beforehand — that's critical. Beforehand, predict what changes you expect to see, both in the operations of the programs, the way they relate to individuals, what kinds of outcome behavior changes you want to see. List all that out beforehand, where, who, et cetera, and then, as that program starts developing, map it, measure all of those process and outcomes and see if they match what your hypothesis was.

Now, there are, of course, other rigorous methods but, basically, what we are talking about, using those main designs, you can do a lot more to — what we're after is trying to isolate the impact of that particular program effect.

Basically, you're collecting more data, OK, more data, assembling more evidence to rule out all those alternative hypotheses that could be explaining a difference or a change in outcome behaviors, OK, so both baseline data, targeting comparisons. There's often — what about all the dropouts? The people who dropped out of the program early, what were their results afterwards?

What about — it's even better if it's because of a program administrative problem that one office shut down and didn't enroll as well. What happened to that office's participants or clients?

Basically, you want to gather a diverse body of evidence. What we do not want to encourage is relying on a single study. Right? This is a problem. A single study is a moment in time, a particular set of people in a particular moment, in a particular place. It's only a sample of all the occurrences that you might expect under this program, and you're worrying about trying to encourage everybody to do it this way. You want more data. So you want to see it in different settings, with different populations, right, and you might actually learn. It may not — it may be robust, but it may be robust for certain kinds of people. Right? It's important to know that.

Our basic observations from the study. As you may have guessed by now, we definitely believe that simply requiring evidence from randomized controlled trials alone would not be a good idea if you really want to identify the full set of effective practices. You're going to miss some.

Secondly, frankly, I think most people in the field are not thinking only of effectiveness when they decide what kinds of policies they're going to adopt. They want to know what the cost is. They want to know what other kinds of resources they need to have in order to implement this. They're going to want to know whether this type of program is going to be accepted by their community. That's a big one. That's a big one. It's not effectiveness alone.

And then, finally, from GAO, it's like, ah, for those of us, we review tons of evaluations. We would know, generally, GAO, government-wide, even nationwide. We would know a lot more if we had better designed and implemented evaluations, if we had better reporting of what people did in the study, and also, particularly, what did the comparison group experience. That often is not described very well, and more evaluations that actually directly compare alternatives that you're considering. Right? You're not considering, you know, a new program of services versus no services. You're saying I would like to know should I do it this way or that way. So you'd like to see evaluations that directly compare them.

To sum up, this first one is the title of the report with a report number, and I was thinking I was going to make handouts and I didn't have time. Sorry. But, essentially, if you go to the GAO website and you look for Reports and Publications, they'll have a little space, and you can just type in the number, GAO 1030, and it'll pop up for you in PDF form.

I want to make a pitch for a couple other documents, and one is — I am on a task force with AEA, American Evaluation Association, and we had last spring written up a little white paper that describes the kinds of good practices, key elements for federal agencies to have in their evaluation capacity in order to actually integrate that into government management and improve programs, and so I would like to sell that here. And so, if you go to the eval.org — the economists beat us to AEA — and look up the road map for the Evaluation Policy Task Force, you will find it.

Also, I'm involved with a informal network of federal evaluators, folks from all over the federal government, and on our website, fedeval.net, we actually have lists of resources, program evaluations, agency websites, GAO reports, oh, books that we really enjoy and think are really good texts and sources and the like. So you can look there. It's not updated very often, but, you know, some of the stuff is classic. And here's how you can find me, at shipmans@gao.gov.



Ed McGarrell: Well, good morning. I really enjoyed Stephanie's presentation, was sitting there thinking we're going use the videotape of that in our evaluation courses at Michigan State.

You also might have seen me chuckling a few times because, in many of those challenges that were described, it felt like it was describing kind of my life. And, at one point, I looked at my NIJ monitor for a current project that involves the evaluation of a comprehensive anti-gang initiative across the U.S., when Stephanie made the comment about the importance of having timely evaluations in kind of the reality of the world, when you're looking at multiple sites, we have 10 sites that are implementing three different program components, all at different times and different places and creating real challenges. So I just recently had asked Louis, we need another extension on this evaluation. So I don't know how timely we'll be.

We also have a natural experiment this morning. I noticed that Stephanie had nine PowerPoint slides, following the dictates of good PowerPoint presentations. I will violate that principle. I have a ton of slides. We'll be moving quickly and maybe an inch deep on a lot of these topics, but, hopefully — and I think that the two presentations complement each other quite well.

You know, I think there's plenty of evidence that we haven't always used the goal of evidence-based practice to guide our research. And many of you are familiar with the University of Maryland study about 13 years ago and reviewed large numbers of evaluations in criminal justice research and different types of research, and they concluded that only about 13 percent of those studies met the standard of methodological rigor.

David Weisburd has a very recent article in Criminology and Public Policy, again, reviewed a large number of policing evaluations, specifically looking at problem solving, and you can see — now, David's pretty strict on his standards of what meets methodological rigor, but if only 10 out of 5,500 are meeting that standard, we've got a ways to go.

And what you typically see — and this is — you know, I think Stephanie described this — is the kind of usual work involves looking at a single site, doing a simple pre- and post-assessment with no comparisons, and that does open us up to all this variety of rival explanations.

For the baseball fans out there that have followed the All-Star game over about the last 40 years, if you do a simple correlation of the tendency of the National League to win those games versus the tendency of the American League, you will find a strong correlation with violent crime trends in the United States.


McGarrell: As the American League has come to re assert dominance, violent crime has dropped dramatically. Now, I don't believe that that's the likely cause of that decline, but that's the problem with those kinds of evaluations.

On the other hand, I think, as Phelan said in his introductory comments, there's a lot of evidence that we are entering an era where there is a serious commitment towards building this kind of evidence-based practice. And you hear that in the leadership of OJP and Laurie Robinson, Mary Lou Leary and others who've been pushing this for quite some time. You also see it within the field of criminology and a lot of different indicators.

Essentially, what we're trying to do is to build this evidence-based practice, and I would argue through three ways. One, as Stephanie laid out very nicely, is to build stronger research designs and to move from these simple designs to controlled comparisons, quasi-experiments and experiments, also to accumulating evidence through a number of studies and building that body of evidence through systematic reviews and meta-analyses and similar approaches. And, thirdly, I think, doing more to try to connect theory to our evaluation. If we can identify the theoretical components behind why we think an intervention should have an effect, if we see an effect or we don't see an effect, but we can link that back to should the intervention have had an effect according to the theory, then we will have stronger conclusions as we look at that evidence.

So what I want to do is to draw on about 15 years of evidence in these three areas. I do this at the risk of being very egocentric and that these are areas that I've been working in, but I think this is kind of a glass half full/glass half empty presentation. You know, you look at the Maryland study and the Weisburd study, and you might feel a little bit pessimistic in our ability to put together this evidence-based — but, from another perspective, if you compare where we are today in 2010 to where we were in the early to mid 1990s, I think we've come a significant way in building evidence. And I'll use these as examples. Other researchers could look at correctional research or some drug intervention research and make similar comments, but these just happen to be the areas that I think I know a little bit about.

So let's talk about addressing non violence. In the early 1990s, I was working in Spokane, Washington, with their police department, with the chiefs and sheriffs in the state of Washington, and doing a lot of interesting, what I thought was interesting work, in community policing and problem solving, and we were seeing some impact in terms of addressing neighborhood level disorder and the relationship between police and citizens, but I remember being asked by the chief of Spokane, “What do we do about this violence problem?”

About that time, I moved back to the state of Indiana, and I began working in partnership with the Indianapolis Police Department. And the mayor at that time was a former prosecutor, and Indianapolis was experiencing big increases in homicide and gun violence. And I remember meeting with him saying, “What does the research say in terms of effective interventions to deal with gun violence and homicide?” And I sat there in silence. There wasn't anything I could point to, and I think that was a pretty fair reading of the evidence at the time.

Since that time, there's been, again, a lot of evidence emerging that there are things that we can do. So I want to work through the development of this knowledge base because I think it builds on Stephanie's points in terms of how we've moved from weak designs but pointing us in a suggestive way towards promising interventions to continually building stronger designs, and I guess that would be the theme that I would hope to leave us with today.

So about this time that the mayor of Indianapolis was posing this question — and I was trying to remember this this morning. It was either an NIJ conference or it was an ASC conference, and Larry Sherman did a presentation, hadn't been published yet, on research he was doing in Kansas City, in which they had done a quasi-experiment. And what they looked at was, if we could use directed police patrol in gun crime hot spots and tell those police officers to focus on illegally possessed firearms, could we have an impact on gun crime? And, basically, the findings that came out of that was, as the seizures of illegally possessed firearms increased by 70 percent, those areas experienced a 49 percent decrease in firearms crimes, so pretty promising evidence here but again a single site, one point in time.

Indianapolis decided to see if they could implement a similar kind of approach. They had two target areas and two comparison areas. The target areas, you'll see labeled and as the north and east districts. They were slightly different strategies, a pretty significant drop in gun violence in that north area, a decline in the east area as well, and they had the comparison, so, in a quasi-experimental kind of approach, had a comparison area that actually had witnessed an increase. And then we also looked at what was happening at the city as a whole, and so the data seemed to suggest that in these two targeted areas using a strategy of directed police patrol looking for illegally possessed guns, that you could have an impact consistent with Kansas City.

We did time series analysis, which ruled out some rival hypotheses but, basically, came away with this conclusion that when you focused on these gun crime hot spots and those illegally possessed firearms, it appeared that you could have a significant impact.

Then there was a third study in the series that came out of Pittsburgh, a very similar kind of approach. They added one additional piece to this research in that they, in addition to using police statistics, they also went to the trauma center and looked at gunshot injuries, and those gunshot injuries declined very consistently with the police data, so, again, suggesting that this kind of approach was having an impact.

When the Pittsburgh findings were presented, Sherman wrote a commentary on that, and he brought several additional studies. So, at that point, you ended up with eight tests of this intervention, all of which indicated a decline in gun crime, and suggesting — you know, you could question any one of these studies, but when you looked collectively, it did suggest a promising approach to addressing gun crime.

About the same time, in mid 1990s, many of you are probably familiar with the Boston gun project. Boston Ceasefire used what some have referred to as a “pulling levers” approach to trying to address youth homicide. One of the things that I think is interesting about the Boston study is it followed this progression of trying to increasingly build the rigor of these evaluations. So the first study that came out of this was a simple pre/post comparison, and after implementing this intervention, they had a 65 percent reduction in youth homicide, pretty impressive. They later tried to strengthen that evidence by comparing what had happened in Boston to a large sample of other cities. Boston was the only place that had experienced this kind of decline, so, again, adding to that evidence base.

Indianapolis became one of the first places to try to replicate that. I put “replication” in quotation marks. The natural scientists in the room would probably throw something at me and say, “That's not really replication,” because it's almost impossible in criminological research where the context is always somewhat different to have a true replication but a very similar kind of approach, very similar kind of findings.

And in our comparison, because this was a citywide kind of intervention, was to look at what happened in other similar midwestern cities, and Indianapolis was the only one to experience that kind of decline, very similar to the Boston findings.

Los Angeles saw a similar reduction or a somewhat more modest reduction, but they did — and, again, consistent with what Stephanie was saying — look at how the program was implemented as well but all three studies coming along and suggesting that this might have an impact.

About the same time, a few years later, some of you may be familiar with the Project Exile in Richmond. Here, the U.S. Attorney said, “I don't know how to address these problems of serious homicide and gun violence, but the one thing I can do as the U.S. Attorney is to prosecute people who are illegally possessing and using firearms to try to incapacitate the highest risk individuals who are shooting people on the streets and then try to deter others from illegally carrying those guns,” primarily focused on felons in possession of a firearm.

The first evidence that came out of Richmond was, again, this kind of simple pre/post comparison where homicides went down after this was implemented. The first, more formal evaluation raised some questions because it appeared that, well, the homicides are going down nationally at this time, so how do we know that Richmond isn't just following the national pattern.

Rick Rosenfeld later did a further evaluation looking at two years follow up and did this with a comparison of all other U.S. cities and found that, well, indeed, when you look at this, it appears that the decline in Richmond may have been due to this.

So you move from — if you take my word that in 1993, '94, we had very little evidence, by about 2000, there were these series of studies that had come out that at least pointed to some promising approaches to addressing homicide and gun violence. That provided the foundation for Project Safe Neighborhoods, a major Department of Justice initiative, in which every U.S. Attorney's office was to create a PSN task force and to do something, kind of building on these Boston Ceasefire, the Indianapolis as part of something called Strategic Approaches to Community Safety, or SACSI, and Project Exile.

There wasn't one single model that was followed or urged for adoption, but the model of the program was to use research and analysis to understand the local gun crime problems, so that you could then focus on those contexts that seemed to be driving gun crime, and then putting limited local and federal enforcement prosecution resources to focus specifically on those local hot spots. If there was a theoretical model, I guess I would argue it was to try to increase the credibility of a deterrent threat for illegal use of a gun in crime.

Now, the PSN evaluation has all the problems Stephanie talked about. It's a national program. So what do you compare it to? And we started off, I think, consistent with what Stephanie would have recommended. It was we'll do some case studies and so look at some of the places that have implemented the program. It appeared that two different models — one based on that Richmond approach and then the other that followed more like the Boston Ceasefire approach. In all 10 of these sites, we saw a decline in gun crime. In 2 of the 10, there were questions that you could raise in terms of the evaluation of that, but the evidence was suggestive.

One of the interesting things is, unlike a lot of prior research, the most rigorous design in Chicago also produced some of the strongest evidence. So you couldn't generalize from these case studies about PSN across the whole country, but at least it was suggestive that you might be having an impact on gun crime. So the next step in this was to try to distinguish by levels of dosage.

So the comparison becomes low-dosage, non-target cities with high dosage — or target cities and high-dosage locales. Again, I realize I'm moving through this pretty quickly, but happy to — actually, there's a presentation on Wednesday where we'll talk about this again.

But, to get to kind of the bottom line here, one of the ways of thinking about the dosage was to look at the level of federal prosecution, and so, in the cities that experienced high levels of federal prosecution of gun crime, you saw a very significant decline in violent crime. Percentage wise, it looks like this, and our kind of contrast was between these target cities, and high dosage sites had a 13 percent reduction. In the low-dosage, non-target cities in low dosage where, in effect, I would argue, PSN really never occurred, you had an 8 percent increase in violent crime. All this was presented at last year's NIJ conference, so let me tell you something new.

We've recently published, looking, controlling for a lot of the other factors that we think affect city levels of violent crime and find that these patterns are consistent, and perhaps most important is we now have some firearm homicide data, and the patterns are very similar to what you saw in violent crime. So, again, the target cities and the high-dosage environment, about a 10 percent decline in gun homicides compared to these low-dosage sites or these non-target cities that actually experienced an 11 or 14 percent increase. That 25 percent gap, I think, is pretty telling or suggestive that this approach can have an impact.

Now, a whole variety of limitations, I won't go into, since that's not nearly as much fun to talk about, but all the things Stephanie talked about, you could raise questions about this national evaluation.

Chicago Ceasefire, another promising approach, a public health approach where they used street workers to try to intervene in violent crime context. Again, a number of you may have heard about this. Wes Skogan and his colleagues have recently published an evaluation of the Chicago Ceasefire. It's complicated because, during this period, Chicago was experiencing a decline in violent crime, so teasing out whether it's Ceasefire or something else is difficult. But what they concluded, they looked at seven hot spot areas. They saw declines in six of those — where Ceasefire was occurring, the violent crime was down to six of those areas, and it appeared that at least in four of those six areas, that Ceasefire may have been associated with that decline.

To complicate the picture, we've recently completed an assessment of a similar program that was based on the Chicago model in Pittsburgh. The people that ran what's called “One Vision One Life” in Pittsburgh were actually trained in the Chicago model. We had three target areas, which are compared to both matched comparisons as well statistically constructed comparisons, and not only did we not see any impact but actually saw an increase in aggravated assaults and gun assaults, leaving us somewhat perplexed on what this all means.

It may be something about whether the Pittsburgh program was implemented with the degree of fidelity to the logic model and the appropriate levels of dosage. It may be something that has to do with the different gang structure in the two cities. It may be the Chicago results were due to that overall violence, or maybe it's something else. But I use this example because I think it, again, reinforces the point Stephanie made of the need to move from one place in one point in time, to looking at multiple interventions and try to — I think the next few years, we'll be trying to sort this out in terms of what do these Pittsburgh findings mean.

So now, 15 years, going from — I was trying to think about doing some kind of a pre and post slide here where if you asked me in 1994, everything would have been blank, but now I think we can point to a number of at least promising, if not evidence-based practices to inform how do we respond to gun crime.

I am going to make Phelan really nervous here because I got one minute to go through the rest of my slides, but let me just — so the point here would be, I think, clearly we've made a number of advances in these efforts. I want to point, go real quickly and could handle this in questions, but that Boston model has been applied to deal with drug markets. At this point, kind of a simple pre and post evaluation has been done.

There are a number of sites that are now using this kind of drug market intervention approach. There's some data that comes out of Rockford and Nashville that we've done that suggests, yeah, maybe this model is having an impact, but the point I wanted to make is I think this is really how we hope this evidence-based model will move forward, and that the high point gave us this promising evidence. A number of other sites have implemented this and seem to be having an impact, and now we see NIJ and BJA working together.

There's an RFP that may just close or is about to close — it is closed? OK. To do a very rigorous evaluation. So BJA is going to provide training to a group of sites in this model who will participate in this evaluation and then, using the most rigorous evaluation techniques, will apply them, and so we'll move — I think at the end of this process, we will know does this promising evidence hold out when you move to a more rigorous point.

So let me just conclude. I'll skip over restorative justice stuff. I knew I wouldn't get that in. But I think there are several points that I would hope to leave you with. One is to think about this as a building block kind of model, oftentimes moving from grounded experience, police-researcher or practitioner-researcher collaborations that look at promising practices, moving towards quasi-experiments and then, where possible, moving to the randomized controlled experiment.

I don't think it is an either/or choice, but we need to think about the strongest design that's feasible, given where we are in the state of knowledge at the time, given resource constraints and the nature of the intervention.

I would argue, again, linking theory to practice, and to conclude, I think one of the important points for those criminal justice executives and policymakers in the audience, I think one of the things that can really benefit the field is for you to insist on high-quality evaluations in that Anthony Braga has a commentary on Weisburd, in the same issue that Weisburd's article is in, and he talks about his experience of working with Chief Ed Davis, first in Lowell, now in Boston. This was my experience in Indianapolis. When the chief, when the mayor, when the juvenile court judge, when a prosecutor say, “I'll try this, but I want to know whether it really has an impact,” that puts us in a position to do these kind of challenging but very important evaluations that can give us that evidence base.

Thank you.


Wyrick: All right. Thank you very much, Ed and Stephanie.

I want to reiterate where we started a little bit on this, and the work out of GAO that Stephanie talked about came to them from Congress. They were being approached by folks who were saying, “You know what, you really need to be committed to this top tier, this randomized controlled trials,” and I'm sure they had a very in-depth conversation with those staffers on the Hill and so on.

But, unfortunately, a lot of folks in those kind of positions, non-specialists, policymakers, decision-makers, they only bring away certain take away messages from conversations like that, and that take away message might just be randomized controlled trial good, everything else bad.

So some of those messages that folks may walk away with can be really detrimental for us because, as we just saw, as Ed, I think, very well demonstrated, that there's a progression, what he called a “building block model,” towards building evidence and towards even getting to the point where it would make any sense to try to attempt an RCT or some of these most rigorous designs.

So I want to talk to you a little about an initiative that we've launched out of the Office of Justice Programs. A little over a year ago, our new Assistant Attorney General, Laurie Robinson, came into her position, and she had 10 major goals for the Office of Justice Programs. And those goals, two of them, included a real emphasis on — one of them was a real emphasis on science, and the focus on science has been articulated from the President, through the Attorney General on down to the Office of Justice Programs.

And one of those goals was really to focus on data-driven strategies and evidence-based practices, and we took a look inside our organization to really figure how should we approach this, and through a fairly extensive process of talking with folks from across our organization, we developed what we call our “Evidence Integration Initiative,” and I've got the three primary goals of that initiative up on the board.

Essentially, we're looking to improve the quality and quantity of evidence that we generate, looking to improve the management of knowledge and the integration of evidence, to inform program and policy decisions and really looking to improve the translation of evidence into practice.

Now, there's a number of goals under these — I'm sorry — a number of objectives under these goals and a variety of activities that we're going forward with, but I'm going to focus on a couple of pieces of this that speak to this question of generating evidence and also the goal two here, the managing and synthesizing and integrating evidence and knowledge, because what we found is that, you know, at the Office of Justice Programs, we're not the office of replicating evidence-based programs. OK? So we don't just fund a list of programs, go replicate these things. We fund a lot of innovation, and we put out a lot of flexible funding streams that allows the state and local professionals to really determine how best to address their needs.

Criminal and juvenile justice are fields where there's a lot of local innovation. In fact, that's where most of our innovation, and it's not — much of this innovation doesn't come about through people who consider themselves program developers necessarily. They're simply people trying to solve problems at the local level, and they're professionals who have a lot of training and experience, but it may not be in research.

What we need to recognize is that our approach to addressing evidence should be cognizant of the fact that what we're trying to do is inform decision-making and get folks who are in those decision-making positions to integrate evidence into their process, and so, therefore, it has to be helpful for them. It can't paint them into a box, and it can't make them feel like their options are extremely limited.

So I want to talk a little bit, to start off with, some of our evidence generation piece and say right out of the gate that we're very committed to increasing our commitment to randomized controlled trials and randomized designs. For the reasons that we've already discussed, that I think Stephanie laid out very well, these are very strong designs. They have a long-lasting impact, and where you can use them, we should be.

So one of the things that we've done for years now is we've put priority in National Institute of Justice solicitations. We put priority on randomized evaluation designs, where possible.

So we'll have a solicitation. We'll say, you know, priority will be given to randomized applications. My own candid assessment on that is that it hasn't yielded a high number of randomized controlled trials that actually get to the funding point, point of funding, and that's because it's very challenging for a researcher out there to unilaterally put a randomized controlled trial into place in an applied setting. And so it's very expensive, it's very difficult, and our timelines, even if we give you 60 days to put that application together, that's extremely difficult to do. So we do get a number of applications. A lot of them don't make it through the very competitive review process.

What we're trying to do to up the number of randomized controlled trials that we actually field is, I think, demonstrated in this one solicitation that I've listed here, the evaluation of the multi site demonstration field experiment, what works in re entry. This closes tomorrow. So, if you haven't seen it, you know, you got about 24 hours to put your —


Wyrick: No. It does close tomorrow, but this solicitation, even if you're not applying for it, you might want to take a quick looking at it. It represents a deliberate effort on our part to work across NIJ and BJA to structure a field experiment out of the gate. So it allows us to address some of those challenges that we heard about in terms of being able to put a program together with distinct exposure to avoid contamination, to foster and nurture that randomization process that's so hard to really keep true in the field.

So we have a great deal of commitment to this, and our goal is to really put a series of randomized controlled trials in place, so that at any given year, we've got a number of them at different stages of development coming out of National Institute of Justice.

But I want to take sort of a big step back for a second because, when we went around our organization and started talking to people about evidence-based practice and evidence-based programs, we started hearing different things from folks in terms of what their priorities were, and it sort of informed our goals.

Well, one priority is we need to do better research; we need to do more research. Another priority is we need to really get that information into the hands of the practitioners in a way they can use them, but then there was this other priority about synthesizing and integrating information and evidence that was very important to people.

We do a lot of work that's cross cutting for our organizations. So we've got a juvenile justice office, we've got a — BJA is sort of our criminal justice office. We've got a victims' office, and we've got a — you know, NIJ, the research and evaluation arm and BJS, the statistics arm, and, in that sense, it makes us very well situated for addressing these questions of evidence, but, frankly, we have cross cutting issues where we're not all working off the same base of knowledge. The same thing's happening out in the field.

So we needed to step back and look at some basic definitions, and now we are — Department of Justice, I mean — at any time, we can go ahead and call up the National Academy of Sciences and get some of the best minds from across the country or even internationally to noodle on these questions.

We didn't do that. Instead, what we really tried to do is say, well, let's come up with some working definitions that really seem to make sense and could relate to our audience, our constituents. So we came up with one for evidence. Evidence is information about an object or question that's generated through systematic data collection, research or program evaluation using accepted scientific methods that are documented and replicable. It's still a little wordy, but the key pieces in there are systematic. It's documented. It's replicable. It's accepted scientific methods. OK?

So that doesn't say anything about whether the evidence is actually accurate. Right? We're taking our cues a little bit from the way evidence is used in the rest of the field of justice. It's not a foreign term to us. Right? We've got prosecutors, attorneys, investigators all over, work with evidence all the time. Some of it's good; some of it's bad. Evidence has different levels of reliability.

DNA evidence might be very strong evidence, but we don't say, “Well, you know, we're only going to look at DNA evidence.” We're not going to throw eyewitness evidence out the door completely just because we know there's biases and errors. OK? So we use evidence of different qualities to make life or death decisions every day all across our country, and we look at it on a case-by-case basis.

Similarly, that's how we feel evidence should be treated in this domain, but we break it out a little bit. So causal evidence is really what we've been talking about when we're talking about effectiveness, getting to effectiveness. Causal evidence provides information about the relationship between activities and interventions, activities or interventions and intended outcomes. So this is basically program evaluation, cause and effect, X leads to Y.

We distinguish that from what we call “descriptive evidence.” Maybe that's not the best name, but we're basically saying there's a whole other world of information out there that's collected through valid statistical or scientific methods and of varying qualities, but they're characterized individuals, groups, events, or processes. They may use quantitative or they use qualitative research methods.

Now I'll tell you a little of my background on this. For nine years, I was the gang program coordinator for the Office of Juvenile Justice and Delinquency Prevention. Now, if you really want to understand some things about gangs, you'd probably do well by looking at some of that qualitative research. There's been some great qualitative research on gangs that tells us about group processes and how the internal dynamics of those gangs play out. It doesn't tell us whether this program will or will not be effective, but it tells us some important information. But how do we incorporate that into our conversations about evidence?

So we've got another term that we're working with that we talk about “actionable evidence.” At what point do you have actionable evidence? Now, this is a very basic definition. It's of sufficient quantity or quality to influence the decision-making regarding a practice or policy.

But now we're talking about something different, and this comes back to me in terms of the types of issues that we face at the Office of Justice Programs, that people around the country are facing. Every once in a while, actually fairly regularly, we have people come to us from a city somewhere in the country or a county and they say — and it's the mayor and it's chief of police. They're coming to Washington to meet with a number of people, but they want to meet with us and they say, “We've got a bad problem, and we need to figure out some ways to address it.”

And the problems that they come up with are things that are usually pretty general, and they're going to get into a problem-solving kind of mode. And it's not going to be — it's rarely going to be something along the lines of what I'm looking for is a list of programs that I can pick from and just simply replicate. That's going to be a piece of it. That may be a piece of it, but they're usually thinking about multiple strategies. They might think about things that what are we going to do in the law enforcement, what are we going to do at the schools, because we're dealing with youth violence and that's a tricky issue. We've got truants, we've got kids that we're worried about who are deep into this, and we've got kids who we're worried about that are on the way, and we know there's stuff going on in the family. What do we have to say to these people?

And if we're just looking for randomized controlled trial evidence, then the answer is we're probably not going to have a whole lot to say to them. But there's so much more that we have, and in order for this field, I think, to really take steps forward in addition to focusing on those highest rigor studies, I think what we need to be focusing on as well is how can we use the wider range of evidence to help these folks.

So there are a number of factors that would probably be considered in terms of thinking about, well, whether is this evidence actionable. Obviously, the method design; what design do you have? What kind of findings are you getting from what kind of designs? The rigor of the particular study. Right? You can do a poor RCT. You can do a poorly executed randomized controlled trial, or you can do a well-executed one.

The point we heard that I think Ed McGarrell was just making, consistent results across multiple studies of different methods in different places by different investigators. So this question of multiple methods, we know that each method you're going to use has its own limitations inherent, but, as you get multiple methods of different kinds pointing in the same direction, you have a level of increased confidence. At what point does that confidence tip to the point where I'm willing to make a decision on that? Well, that might depend a lot on the importance and the size of the decision. Right?

Going outside of criminal and juvenile justice field, when we look at public health, we haven't seen — and I stole this line, but I think it's so good it's worth stealing. I'll credit David Altschuler, but we haven't seen a randomized controlled trial that says that smoking causes cancer or that it's bad for you. There are some things that we don't, for any number of reasons, conduct RCTs, but we know at this point that that's the result we're seeing. So there's a level; there's an accumulation of actionable evidence. On a smaller issue, you might be willing to make a move, a decision locally, based on a smaller piece of information.

So I want to give you a sense for how we can get to actionable evidence because it speaks a little bit to this question of, you know, we talk about how do we evaluate and how do we test the effectiveness of any particular programs, but I think what a lot of folks are also looking for is what should we develop in the first place. And whenever they were doing that deterrence work, whenever they were working up those different approaches to dealing with gun violence or any of these pieces, they were using information. They were using evidence at the outset to begin to develop those designs in the first place, and we've got to be in that game, and that game has to include — that game has to be recognized as part of the evidence-based work that we do.

So one of the things we're doing out of the Evidence Integration Initiative in our organization is we're testing out this idea of developing integration teams, evidence integration teams that are really going to draw together personnel and resources from across the Office of Justice Programs, to delve into the evidence and really synthesize and articulate, synthesize the evidence and try to articulate principles for practice and programs and policies.

This is not an enterprise that is entirely new. We've had any number of efforts to review the literature and make suggestions. One of the things we're doing here is also for the purpose of developing an understanding and an ability to use evidence within our own personnel and staff in the Office of Justice Programs because, if they're going to have any effect in being able to communicate the importance of that effort and the findings of that kind of effort to the constituencies that they serve, they need to understand it themselves a little bit.

So we're not just working with our researchers. We're not just working with our Ph.D.s, although we have a fair number of them. We have brought together program and policy people from across Office of Justice Programs to focus in on some target areas and dive into the research a little bit, and so they're going to look at some specific topics and start drawing together the evidence base.

I want to give you an example of this actionable evidence, what I think is actionable evidence based on a synthesis of existing — in this case, descriptive evidence. So, again, my background, I've done a lot of work on gangs, and this was something that we published out of the U.S. Attorney Bulletin in 2006. This is a graphic that was included in a chapter that I wrote, and, essentially, what this graphic does is convey an accumulation or a synthesis of descriptive evidence in a way that informs practice and policy. And I'll go through it pretty quickly here.

So the vertical axis is the share of illegal activity increasing as it goes up. The horizontal axis is relative share of the population, and what we're trying to talk about now is addressing gang problems within a community.

So we start from the understanding that we're talking about a community that has a significant amount of gang activity, and what we do is we present this general graphic to show that in those communities, you have a small portion of people at the top of this triangle, the group number one, small proportion of people who are involved in a lot of the serious violent offending. This comes from research that's been replicated in many cities, and the one that comes to my mind is out of Orange County, California. Eight percent of all of their offenders were responsible for over 50 percent of all their serious violent offenses. So that's one group that's out there in your community.

Below them, you've got another group that includes a larger number of active gang members and associates. Again, a whole body of research that gets into this question of once you enter a gang, your likelihood to offend, violent offending, property offending, weapons, drugs offending, all of that goes up. We know that from longitudinal research. We know that from qualitative research. We know it from survey and interview research with gang members themselves. And we know that they're not all up in that top group. So they're out there committing offenses.

Third group you've got is a high-risk group. Again, we get this from the longitudinal work. We get it from survey work, and it's the younger kids. And we know when they come to the attention of the system. They usually start coming to the attention of the system 8, 9, 10 years old. They start showing up with some delinquent acts, and they're high risk. We know what those risk factors are, and we know how risk factors work now. Again, all descriptive evidence, and it tells us something about this group of young people who are going to be the part from which the next cohort of gang members is most likely to come but not all of whom will go into the gang.

Finally, at the bottom of the triangle, you've got everyone else who lives in that community. It's the general population. They don't offend a whole lot, and they're a much larger group. So why is this actionable? Because all of those groups track or correlate directly with specific types of anti-gang strategies.

So, when I published this article in 2006, it was to the U.S. Attorney audience, and they had a vested interest in trying to understand gang prevention and how it relates to their overall efforts because they're a little more enforcement oriented.

We started getting invitations, “Come out and speak to us.” I've spoken all over the country essentially about this diagram, which is nothing more than a bunch of descriptive evidence synthesized into one sort of digestible graphic that puts together the relationship between different anti-gang strategies.

I've had people go beyond where I've said and say, “Well, really, this also speaks to how we should our investments as well.” So I don't think that's quite exactly correct, but I think there is something there because you got to put a lot of energy to that top group. But what it also says is how do you put together a multi-strategy, anti-gang program within your community in a way that's going to hit on all the people who are really involved.

So this is a way that I think we should also be looking at evidence, that I think most of the discussion really isn't going in this direction right now. We're going to try to see how far we can go with this because I think this also puts social science researchers in an important role within the field. The more we can start talking about actionable evidence, the work that we can do to help develop programs and help develop them through these building blocks, if you will, as Ed put it, this building block model of building evidence towards effectiveness.

So, with that, I'll leave it and we'll open up for questions. Thank you very much.