Best computing practices
See Wilson, et al 2014
- Write programs for people, not computers
- Let the computer do the work (functions, scripts)
- Make incremental changes (use version control)
- Don’t repeat yourself (or others): no copy-paste!
- Plan for mistakes (add assersions to programs, code defensively)
- Optimize software only after it works correctly
- Document design and purpose, not mechanics
- Collaborate (github pull requests/issues)
Organization of projects
This section is inspired by Karl Broman’s notes.
- Put everything in a common directory. If using RStudio, for example, create a new project which will contain all the files corresponding to this project. You can link this project to a github repository (see below)
- Separate raw from processed data; it is tempting to hand-edit datafiles: don’t!
- Separate code from data
- Don’t use absolute paths
- Use readme and markdown (
md
) files to explain structure of folder and files within folder/subfolders; create logfiles as a “Dear diary” with details of analyses. See this Markdown cheatsheet
- Use R Markdown (
Rmd
) files for data analyses and reports. See this R Markdown tutorial
- Slow down and think about file/folder organization
Convention on naming files (very important for globbing to narrow file listing):
- Avoid spaces, punctuation, accented characters, case sensitivity
- Deliberate use of delimiters
- Name contains info on content
- Put something numeric first (left pad numbers with zeros)
- Use the ISO 8601 standard for dates: YYYY-MM-DD
Write clear code
This section is inspired by Karl Broman’s notes.
- First code that works, then efficiency
- Readable for humans; code format: indentation, white space, meaningful names
- Modular, reusable (no copy-paste of lines: functions)
- Write general code (not specific to data/situation at hand)
- No global variables ever!
- Comment code but mostly big picture, major sections, input/output, not minor details that can be understood from the code itself; “plan to spend 1/4 time commenting”, Karl Broman
- Meaningful error messages; tests/checks for inputs; document assumptions on input
- Statistics: rows=individuals, columns=variables
- Machine learning: rows=variables, columns=individuals
- Code defensively; handle cases that “can’t happen”
- Slow down, breathe, don’t be in a hurry!
Data Organization in spreadsheets
(Broman and Woo, 2017)
- Be consistent
- Don’t use female/male and F/M in the same file
- Choose good names for things
- No spaces in column names (or anywhere)
- Avoid special characters
- Write dates as YYYY-MM-DD
- No empty cells
- Put just one thing in a cell
- Make it a rectangle
- Create a data dictionary
- No calculations in raw data files
- Don’t use font color or highlighting as data
- The file should be machine-readable, not human readable
- Save data as text files