Migrating half a million Hearst Museum images to Box

The Phoebe A. Hearst Museum of Anthropology (PAHMA) recently found itself with over half a million digital catalog card images that are in active use, but needed to be duplicated in order to preserve them in a redundant, reliable archive. Copying a few hundred, or even a few thousand files is a relatively straightforward task. Assuring that 527,000 files are successfully copied in a reasonable period of time, without requiring constant attention, is trickier. Research IT worked with PAHMA’s Dr. Michael Black to identify a suitable repository and accomplish the migration, and in doing so has identified patterns other researchers and organizations can use to accomplish similar tasks. 

“Active Data Management” is the set of practices and resources available to researchers for handling data during the active phases of a research project. The Research Data Management (RDM) Program suggests a number of options and resources for active data management, from which Dr. Black's team selected the cloud-hosted Box platform, supported by by UC Berkeley’s IST. Box was the right choice for PAHMA because it supports multiple levels of access control, provides a simple web client for data access, and is available free of charge and with no limit on data storage for university staff. 

A number of file transfer tools are available to Berkeley researchers, but migrating hundreds of thousands of files into Box presented challenges for the Hearst Museum team. It turns out that few tools are suited to transfer large numbers of files in a complex directory structure over the extended period of time necessary to move terrabytes of data across the campus network and commercial internet. After considering the options, Research IT’s consultants recommended the command-line utility lftp as the best among a set of imperfect options, chiefly because the mirror feature enables transfer of multiple files in parallel, and lftp is easily scripted. On the other hand, lftp does not have the robust validation features and error reporting of powerful data transfer platforms like Globus, which does not currently support data transfer to Box. Also complicating the task, Box has recommended limits for an ftp transfer session that fall short of the quantity of data the PAHMA team needed to migrate; and lftp can have issues with file or folder names that contain special characters like those found in the PAHMA data. To work around these limitations, the transfer tasks were divided into sets of smaller transfers with careful validation of each. 

The PAHMA data set was deduplicated to limit unnecessary transfers, after which approximately 527,000 files were transferred into a Box account owned by a UC Berkeley special purpose account (SPA). Some of the workflow performed in the course of this transfer was manually administered, but Research IT’s consulting team is exploring scripting options that can reduce the need for ongoing human intervention in a transfer of this size and complexity.

Researchers who are considering using Box as a repository for large numbers of files in complex folder structures should consider that the Box web client is not well suited for managing large file sets. Only a handful of files are listed on each page. Tagging, which is used to facilitate filtering, must be applied manually to each individual file.

Research IT can assist campus researchers and organizations with large or complex data transfer tasks by consulting to analyze the issues involved, then advising on best practices and patterns discovered in the course of studying or assisting with prior data transfer projects. Research IT continue to refine in-house scripts, and is actively monitoring development and availability of open-source and commercial tools that facilitate using Berkeley’s unlimited storage offering on Box as an active data repository. 

Contact Research IT to request a consultation, or if you have questions about data management or the tools described in this article.


The work described in this article was supported by a CC*DNI CI Engineer grant generously funded by the National Science Foundation.