Properly Documenting Your Data

Documentation is somewhat frugal

In this module, users will learn about the different types of supplemental documentation that ensures proper future use of data. Additionally, this module will introduce the concept of metadata schemas that aid in the proper indexing for discovery of data. Discipline specific examples will be provided.

  1. Reasons for documentation: What would someone who is unfamiliar with your data need in order to find, evaluate, understand, and reuse them? You need to be able to provide documentation that provides answers to all of these questions, not only for others who may want to use your data, but for yourself. You may revisit your own data 6 months, 1 year, or 5 years down the line and proper documentation serves as a reminder to yourself. Additionally, good data documentation makes day-to-day collaboration easier, more efficient, and more accurate for other researchers with whom you are working.
  2. General best practices: Take better notes for everything you do. Record observations in the order in which they occur, record ID numbers for reagents and other solutions used in the lab, and break up your notes using headings. Keep a detailed lab or research notebook that is legible and in English (or the main language used by the PI and the lab) so that when you leave the lab, it is legible and discernable to your successors. Additionally, make note of processors or protocols that failed to work, and make note of things that were successes.
  3. Study level documentation vs. data-level documentation: There are two levels of documentation this module covers: study level and data level. Study level documentation provides high level information on the project design, data organization, collection methods, and data manipulations. Data-level documentation is known as metadata and is expressed through a variety of different schemas. Schemas may sometimes be discipline specific and aid in the indexing and discovering of data.
  4. Examples of discipline agnostic study level documentation: readme.txt

Research and research notes will often live in two different locations, which may make interpretation of the data difficult. A readme.txt file is a simple txt file accompanies a dataset and includes all of the necessary information needed to understand the data. The file should contain the following information:

  1. Project name
  2. project summary
  3. previous work on the project and location of that information
  4. funding information
  5. primary contact information
  6. your name and title (if you aren’t the primary contact)
  7. other people working on the project
  8. Location of data and other supporting materials (including lab notebooks)

I . Overall organization and naming conventions used for the data

  1. Examples of discipline agnostic study level documentation: data dictionary. A data dictionary is a collection of the names, attributes, and definitions about data elements that are being used in your study. By including a data dictionary, you ensure a standard use of variables across a cohort of researchers.
    1. Start out by going through each variable in your dataset and record what you know about it. This might include:
      1. Variable name
      2. Variable definition
      3. How the variable was measured
      4. Data units
      5. Data formats
      6. Minimum and maximum values for the given variable
      7. Coded values and their meaning
      8. Representation of null values
      9. Precision of measurements
      10. Any known issues with the data
      11. Relationships to other variables
      12. other?

(include an example on the slide)

  1. Metadata: Metadata is technically “data about data” which is a broad, all encompassing definition. Used here, however, it refers to a standard, highly structured form of digital documentation. Metadata facilitates discovery, retrieval, understanding, and use in a standard manner. Metadata is better to use than general notes when you’re working with a large amount of digital information. It’s also the preferred method for computability. Research notes are more flexible. There are different metadata standards and schemas depending on your discipline (they are more often used in the social sciences).
  2. Metadata Basics
  3. Adopting a Schema

Training series: 

Learning outcome: 

Users of this module will learn about the various types of documentation for data and understand why it is necessary for future use of the data set.