Will you be able to access your research data 20 years later?

Chris Hoffman photo

by Chris Hoffman

I recently had the opportunity to ask myself this question when a colleague from Oxford visited my home in Berkeley. Peter Northover, a noted archaeometallurgist, had helped me with the metallographic analysis of a set of copper and bronze artifacts excavated from Bronze Age archaeological sites in Mallorca, Spain. As I prepared for his visit, I decided to leaf through my dissertation, something I had not done more than twice since filing in 1993. I was more than a little surprised to find myself wanting to reexamine the data. But could I even find the files? I knew I had backed up my dissertation and data. However, the zip drives and floppy disks in the back of my closet did not look very promising. In fact, I did not even have the hardware to read those media easily at hand. I suspected the files were somewhere on one of my current home computers. I tried to recall the history of the computers I had used since writing my dissertation on an Apple Macintosh SE. For a dozen or so years after graduating, I had switched to Windows, mainly because I was beginning a career working with data and databases. (At the time, Mac was not a good platform for that kind of work.) I had an old Windows computer out in the shed, but I was not optimistic that I would be able to get it running.

[Lesson in hindsight #1: Back up research data and files in an organized fashion. Leverage institutional and cloud resources that will persist beyond your local computer. Do not rely on external media.]

Could the files be somewhere on my current Mac laptop? I found a promising folder called "Dellbak", hopefully a backup of files from a Dell PC. Indeed, inside that folder was one called "Diss", and it contained a series of folders that housed the research data and the dissertation documents. The folder names made sense, but the file names had been hopelessly jumbled, e.g., !DAVIS19.3(B, by multiple episodes of copying from one computer to the next and across platforms.

[Lesson in hindsight #2: Create a README file in your directories to document the organization of files. Use descriptive file names and folder names.]

What applications would open these files? Double-clicking did not open any of them. I knew I had used early versions of Microsoft Word and Excel and FileMaker Pro to create most of the files. However, some of the folder names reminded me that I had used other programs. I suspected I might be able to reconstruct some of the S files that had been used for some basic statistics, and I resolved to look into that later. I also had a series of scatterplots built in some program called Cricket. Those would almost certainly need to be reconstructed from the raw data.

[Lesson in hindsight #3: When possible, use standard software. Save output files into other formats such as text and PDF that are more likely to withstand the test of time.]

Most of the data files had been created in Excel or FileMaker Pro. Though double-clicking did not open the files, I was pleasantly surprised to learn that the latest version of Excel on my Mac laptop was able to open the files that had been created in the earlier version of Excel. I just had to discover, by trial and error, which files were Excel ones using the Open command in the File menu. Similarly, I discovered that the current version of Word on my laptop was able to open the corresponding files from my dissertation. The FileMaker Pro database would pose additional challenges. Without a version of the popular application on my current computer, I could not even attempt to open the old research database. Fortunately, I saw that I had exported the data from the database into a tab-delimited text file, and that file was perfectly compatible with my text editor and Excel. I guess I had done one thing right! So far, I estimate that I have been able to access nearly all of my original research data. As I compared those data to the tables and charts in the dissertation, however, I came across another problem. Although I might now have most of the raw data, it was not completely clear how some of the summarized results selected for my dissertation had been generated. The captions in the dissertation were only partly helpful.

[Lesson in hindsight #4: Document how derivative data sets and summaries are produced. Use programs that generate code that can be rerun to produce important results.]

Now that I have relatively good access to my dissertation data, I can examine some of the questions that came up when I reviewed my dissertation with my Oxford collaborator. I was pleased and a little surprised to discover that my research data were still relevant. As I reviewed the more recent literature on prehistoric metallurgy, I saw that my data were still unique and could contribute to discussions about the early development of this technology in Mediterranean Europe and beyond. Researchers often believe that their old data must be surpassed by newer data, but this is not always the case. There is potential research value in those older data sets (even from dissertations!), but first you need to be able to access those data files and make sense of them. However, there is one more important step. In order to unlock the research potential of your data, you must also provide access so that others can find and reuse your data. That might not have been a major emphasis in 1993, but it is now being emphasized by funding agencies and research community in general.

[Lesson in hindsight #5: Publish your well-documented data in a discoverable repository or resource such as DASH.]

Through its consulting services and online RDM guide, UC Berkeley's Research Data Management program hopes to help campus researchers ensure the greatest potential of their research data for years to come.