DCSIMG
Skip to local navigation | Skip to content
Office of Justice Programs (OJP) banner
National Institute of Justice (NIJ): Research, Development, Evaluation
 

Data Archiving Strategies for NIJ Funding Applicants

NIJ requires data sets resulting from funded research to be archived with the National Archive of Criminal Justice Data (NACJD) [1]. Data sets must be submitted 90 days before the end of the project period. In your application for NIJ research grants, you must include a brief (one- or two-page) data archiving strategy.

This data archiving strategy must briefly describe the:

  • Anticipated manipulations of original, intermediate and final data sets (as applicable).
  • Methods of documentation of such manipulations.
  • Preparation of original, intermediate and final data sets for archive submission.

Examples of Data Archiving Strategies

To help you create your own data archiving strategy, the following is a simple generic example of a data archiving strategy based on strategies from past applications. Three examples from successful applicants are also included.

Generic Example —

  • The applicant agrees to deliver the data files and pertinent data management documents to NIJ for archiving 90 days before the final project date.
  • The applicant indicates the sources of the data that will be used in the study.
  • The applicant agrees to submit to NIJ for archiving the raw data used in the study; all variables used in the study, whether recoded or newly created; the computer programs used to convert the data into multiple data sets at differing levels (e.g., state, metropolitan area, county and city levels; individual, family and community levels); any intermediate data set; and the final data set, including all modifications to the original data files.
  • The applicant designates the formats in which the data will be provided to NACJD (e.g., SPSS, STATA, SAS).
  • To document the differences between the original data files and the final data set, the applicant agrees to document any recoding or modification of original variables as part of the data archiving package, focusing on any scales or recoded variables developed for the study and including notes on construction of the variables and the rationale for their use.
  • The applicant agrees to provide a codebook for the original, intermediate and final data files, including variable names, variable labels, value labels, recodes, distributions and appropriate value codes for missing information.
  • The applicant agrees to provide any other information needed during the archiving process and any other information that will enable other researchers to replicate and extend the study.

Following are data archiving strategies from successful applicants. They have been revised only to remove information that would identity the applicants.

Example 1

This project will follow a comprehensive data archiving plan that should facilitate future research on Chicago neighborhoods. The principal investigator has experience in preparing large-scale data files for research institutions (e.g., Lewis Mumford Center for Comparative Urban and Regional Research, State University of New York — Albany; Spatial Structures in the Social Sciences, Brown University) involving the provision of raw data variables and processed variables as well as extensive formal documentation. The archived files will include data and documentation for variables from all processing stages of the homicide and census data. The Community Survey of the Project on Human Development in Chicago Neighborhoods carries confidentiality requirements and is therefore not part of this data archiving strategy, although it remains available through the National Archives of Criminal Justice Data (NACJD). The archived files will consist of one codebook and two main data files — one at the tract level and one aggregated to the level of the neighborhood cluster.

The archived homicide data will include all computed variables for various types of homicide with various circumstances for the years 1980–2000. The proposed project will only use total homicide counts; however, another project will compute other homicide variables and all computed variables will be available in the archived file. Because the data used to compute these variables are already available from the Chicago Homicide Project archived at NACJD, archiving the raw data in this file would be redundant. However, the raw data can be included if NIJ and NACJD   prefer. The archived census data will include all raw variables that were drawn from the decennial censuses of 1980, 1990, and 2000 as well as the computed variables. Inclusion of both raw and computed variables will allow future researchers to compute their own variables differently from how current researchers compute variables for the proposed project. It is also very likely that some extracted and computed variables will not be used in the final analysis. However, these will also be included in the archived data set for the use of future researchers. This data will be very valuable to researchers interested in studying Chicago neighborhoods but unwilling or unable to invest the time it takes to extract data from the U.S. Census for multiple decades. Again, the homicide and census data will be merged and provided in two separate data files — one at the tract level and one at the level of the neighborhood cluster.

A comprehensive codebook will be provided. This will include background information on the goals and design of the proposed project as well as background information on all sources of data used to generate the archived file. A detailed description will be provided for each variable. The description for raw variables will include a clear indication of each source file, a description of the variable from the source file and the column locations from the source file. This will allow future researchers to determine the source of the raw variable with certainty. The description for computed variables will include the procedure used to compute the variable, with clear references to the component variables used in the computation. This will allow researchers to perfectly replicate the computation of variables by using the original raw variables, if desired. The purpose of this comprehensive documentation and data archiving strategy is to allow future researchers to “follow along” in the computation of every variable for the purpose of replication and expansion of the proposed study.

Example 2

Although an investigator can never fully accommodate potential misalignment between his/her measures and data structure and future studies by others, the quality of the archived information can certainly be assured. This entails maximizing the ability of future researchers to assess the suitability of the data for their purposes by providing complete and accurate files and documentation. To that end, several steps will be taken to ensure that data emerging from the proposed study will be of use to others:

  • The data files (and pertinent management documents described below) will be delivered to NIJ for archiving 90 days before the final project date (March 30, 2011).
  • The initial data set extracted from the overall Project on Human Development in Chicago Neighborhoods (PHDCN) files and final data set, including all variables used in the proposed study (i.e., recoded or newly created variables), will be a part of the data files delivered to NIJ for archiving. This is true for files from both the individual and the neighborhood cluster levels.
  • As PHDCN data encompass a wide array of measurement domains and waves, a document containing the original (a) wave, (b) instrument and (c) cohort for each variable in the data file will be included with materials delivered for archiving. This will ensure that individuals who wish to augment data from this study with those from the original PHDCN files will be in a good position to do so.
  • A document pertaining to any recoding or modification of original PHDCN variables will be included with the data archiving package delivered to NIJ. This will focus on any scales or recoded variables developed for the proposed study. Such information will include notes on variable construction and rationale for use. This will serve as documentation of the differences between original PHDCN data files and the final data set(s) used for analyses.

    A codebook with full variable names, variable labels, value labels, and full missing-value codes will be included for both the original (i.e., extracted) and the final (i.e., analyzed) data files.
  • Each of these steps will be taken as the files are being developed — rather than at the end of the study period — to ensure that documentation is thorough and the files can be delivered to NIJ by the designated date.
  • The investigator will be happy to provide any additional information needed by NIJ or the Inter-University Consortium for Political and Social Research staff during the archiving process. Also, as noted in the dissemination plan, provisions will be made to share information relevant to the multivariate analyses with researchers who are interested.

In summary, the following files will be included in the data archiving package:

  • Original data files as extracted from PHDCN (individual, community level).
  • Document with sources for data drawn from original PHDCN files (e.g., cohort, wave).
  • Document with notes on recoding or computation of new variables.
  • Full codebook for original and final data files.
  • Final data files as used for analysis (includes modifications to extracted files).

Example 3

The proposed project would integrate numerous secondary data sources, including several already included in the National Archive of Criminal Justice Data (NACJD), and others that would be wise additions to this archive for the comprehensive study of crime trends. All of the raw data used for the project, as well as the computer programs used to convert the raw data into the final state, metropolitan area, county and city data files, would be submitted to the National Institute of Justice (NIJ) for distribution to others. Consistent with one of the key project objectives — to contribute to a comprehensive data infrastructure for studying crime trends — it would be beneficial to create a separate subarchive within NACJD that houses the data components used in and produced by the proposed work and perhaps other data compilations of crime trends. This suggestion is not driven solely by a belief in the importance of sustained research on crime trends, but also by the reality that this area of inquiry is somewhat unique: a relevant comprehensive data infrastructure would draw from literally hundreds of electronic data files. Providing only the final data set(s) in a centralized location from a study such as the proposed project (even with details on the location of the original component databases and how the final files and measures were constructed) is only modestly helpful, as this still requires that a substantial amount of work be redone by others (e.g., downloading and setting up each of the component files and merging the files) to replicate the project results and potentially make important modifications. Currently, it appears that researchers who study crime trends rarely take advantage of the work done by others. A centralized data archive would provide a major boost to researchers wishing to replicate the analyses   as well as those who update the data, analyze crime trends in the future and use the data to generate crime forecasts. As noted, the data used for the proposed project would be acquired from various sources, including the Uniform Crime Reporting Program, the Supplemental Homicide Reports, police employee (LEOKA) data, the National Prisoner Statistics, the National Corrections Reporting Program, the National Judicial Reporting Program, the Annual Survey of Jails, the Jail Census, the National Judicial Reporting Program, the National Institute on Alcohol Abuse and Alcoholism, the Immigration and Naturalization Service, the Department of Homeland Security, the Current Population Survey, the National Center for Health Statistics, the Bureau of Labor Statistics, the Bureau of Economic Analysis and the National Highway Traffic Safety Association.

Much of the data to be compiled in the proposed work has yet to be incorporated in a central repository, which could stimulate a comprehensive and systematic research agenda on recent crime trends. To facilitate such a research agenda, all of the raw data, intermediate files, and final databases compiled for the proposed project would be prepared and submitted to NIJ and also would be available on several publicly accessible Web sites (e.g., the principal investigator’s campus Web page and a Web site maintained by Professor Richard Rosenfeld, www.crimetrends.com). An important dimension of this planned dissemination is that it would include, in centralized locations, all of the raw data, all programs used to integrate the various databases and create the measures developed in the study, and all of the final databases analyzed in the proposed project. The latter is critical for replication of the proposed work, but the former two elements are particularly vital when modifying the coding and measurement decisions applied in the study and for future expanding the data in the future.

The raw data would be accessed and modified, using widely known programming code (e.g., SPSS, SAS, and Stata), and would be archived along with the other materials. The programming code would contain a detailed recording of the modifications made, the variables created, and the analyses conducted. The specifics would vary somewhat across files, as would the structure of the files, but a project data codebook would serve as an integrated document that provides details of the various files and provides explicit guidance on how the final databases have been generated and how other researchers can modify the procedures used.

The project codebook would consist of two parts. The first would describe, in detail, how each of the final databases (e.g., the state, metro area, county and city) were constructed, beginning with a listing of the files accessed, and embedded Web site links to the initial raw data and data definition statements (or original systems files) as well as the programs used to modify these files, link them with other files (where relevant) and create the specific measures used in the proposed research. In many cases, the process would entail looking at a large number of files (often, one for each year of a given collection), but requiring only modest modifications and fairly simple programming. While assuming that all of the component files are housed within a centralized location in NACJD, this section of the project codebook would enable a relatively easy and straightforward replication of the databases and analyses generated in the project.

A second part of the project codebook would summarize the major analyses done in the proposed project, providing computer programming code so that others can simply run the code to replicate the analysis and easily modify it as they see fit. Where relevant, the programming code will include links to the portions of code used to generate specific results shown in reported tables. Overall, the goal of the data archival strategy in the proposed project is to make the data infrastructure and the research as transparent as possible. Advice is welcome from NIJ and from personnel who coordinate NACJD holdings on how best to accomplish this goal.

Notes

[1] There are limited exceptions to this policy. See the specific requirements as detailed in each solicitation and the funding documentation for every award.

Date Created: January 15, 2010