General and Best Data Practices

In addition to the practices below, we strongly suggest you review the Recommended Reading on our Other Resources page.

General data management and efficiency best practices

  • Consider reviewing the Strongly Recommended References on our Other Data Resources page.
  • For large projects, keep a README file in the top level directory with a project summary including who was involved, dates, and a listing of the directory stucture and important files within that project folder. Avoid unnecessary creation of data sets - combine multiple data steps into a single step if possible.
  • Keep files zipped or compressed if you aren't using them.
  • Check for duplicate files when sharing a project folder with multiple users.
  • Do not keep duplicate copies of raw data in different software formats.
  • Avoid keeping unnecessary interim data sets.
  • Store common sub-expressions in variables rather than re-computing them.
  • Identify which portions of the program are using the most time. In Stata, "set rmsg on" causes the run time to be displayed after each command; in MATLAB, use the "tic" and "toc" functions to compute elapsed time.

 

Optimization and maximum likelihood (any language)

  • Supply analytic derivatives and Hessian if possible.
  • Supply good starting values (for example, if bootstrapping, use the parameter values from the original data set as starting values for the bootstrap samples).
  • If calculations don't depend on the parameters being estimated, move them outside the likelihood or objective functions calculations so they are only done once, and save results in global variables.