General and Best Data Practices

 

General data management and efficiency best practices

  • Consider reviewing the Strongly Recommended References on our Other Data Resources page.
  •  
  • For large projects, keep a README file in the top level directory with a project summary including who was involved, dates, and a listing of the directory stucture and imporant files within that project folder. Avoid unnecessary creation of data sets - combine multiple data steps into a single step if possible.
  • Keep files zipped or compressed if you aren't using them.
  • Check for duplicate files when sharing a project folder with multiple users.
  • Do not keep duplicate copies of raw data in different software formats.
  • Avoid keeping unnecessary interim data sets.
  • Store common sub-expressions in variables rather than re-computing them.
  • Identify which portions of the program are using the most time. In Stata, "set rmsg on" causes the run time to be displayed after each command; in MATLAB, use the "tic" and "toc" functions to compute elapsed time.

 

Optimization and maximum likelihood (any language)

  • Supply analytic derivatives and Hessian if possible.
  • Supply good starting values (for example, if bootstrapping, use the parameter values from the original data set as starting values for the bootstrap samples).
  • If calculations don't depend on the parameters being estimated, move them outside the likelihood or objective functions calculations so they are only done once, and save results in global variables.