This question, a composite case based on recent consulting requests, was posed to a staffperson at UC Berkeley's Information Security and Policy office.
Background: We've got sensitive data stored (more or less) securely on CalShare, Box, etc. We now want to analyze it for our research. The compute environment for the analysis varies. It might be done on: our laptop, on a departmental server, a departmental cluster, the campus's Savio HPC cluster, Amazon Web Services, an XSEDE-provided computational center, etc.
Q. What are best practices for ensuring that data remains secure during - and after - data analysis?
A. The answer to this question is "it depends."
For research data, it will be important to adhere to your own data use agreements with any data you're receiving from external entities. Before placing the data onto off-site systems such as AWS, make sure you are allowed to do so in your data access agreement.
As far as best practices go, the Minimum Security Standards for Electronic Information (MSSEI) would be your best resource. Particularly, MSSEI control 15 - Data Loss Prevention:
Of course, any research data classified as PL1 or PL2 would need to meet all the MSSEI requirements for those protection levels.
When you're moving the data around on clusters and departmental servers, I would advise using strong encryption whenever possible when storing or transmitting the data. And a really important step -- removal of the data from these resources when you are done with your analysis. Too often, breaches of sensitive data occur because stale/archived data was left on a server or endpoint after a project has long been completed.
Lastly, I think an extremely important concept that applies to any work with sensitive data is to have "separation from high-risk activities". While we have a policy with this language for the use of administrative accounts (see MSSEI 10.2), I would suggest applying it to any work with sensitive data.
That is, avoid web browsing, email, Skype, etc. when working with sensitive data, and use a separate machine for such work whenever possible. We're seeing a lot of research environments move to Citrix or other virtualized environments for this reason. Endpoints pose significant risk (e.g. your system is compromised because of a Flash 0-day exploit because you checked out a video clip on your break, using the same machine where all your data is stored).
Q. The data we were provided for our research uses coding: where identification of the humans or other entities is being done via codes that ostensibly obscure their identities; that is, without the use of a 'code key' that associates those codes with the actual entities. To what extent does that lessen - or not - the extent to which the data needs to be protected?
A. The answer here again -- it depends. Our general guidance is that the code key for de-identified data should be treated like a credential. It should be stored securely (using encryption whenever possible) and separately (different device with strict access controls, or offline completely) from the de-identified data.
Certainly, if the de-identification is done sufficiently, then the extent to which the data needs to be protected is lessened. We leave it up to the researchers to decide what they need to do with the de-identified data such as moving it to a laptop for analysis.
Keep in mind that many data access/usage agreements with external entities vary quite a bit -- and limit what can be done with de-identified data. Those agreements must be checked thoroughly before knowing what you can do with de-identified data sets -- as it may be in opposition to what you can do with the data using the campus data classification and minimum security standards (e.g. you've properly de-identified PL1 data and want to put it on Box, but your data provider's usage agreement explicitly states the data must *not *be stored on the Internet or with an external service provider).
If you have questions about specific data sets, please let us know and we can try to get those answers for you. For instance, de-identified Social Security Numbers, if properly de-identified, would not trigger a breach notification requirement.
Please contact RDM Consulting (firstname.lastname@example.org) to help you with your questions about sensitive and restricted data.