Finding Data for Your Next Project

Guide to Finding Data at Berkeley

Context

Usage context is critical.

  • Have a firm grasp of your research topic, the claim (or hypothesis) you’re trying to make and the type of evidence you need to support your claim.
  • Understand whether your analysis will be purely descriptive (summaries, averages, simple tables, graphs of the data at hand) or inferential (using statistical methods to make inferences broader than the data at hand).

Dimensions

Clearly specify the dimensions of your analysis

  • Example of typical dimensions
    • Unit of Analysis (e.g. individuals, households, companies, counties, states, nations)
    • Geography (e.g. all European countries)
    • Time Period (e.g. 1980-2006)
    • Frequency (e.g., annual, quarterly)
    • Restrictions (not always apparent, but users should start planning as early as possible for IRB approval, application processes for restricted data (e.g Census data research), secure storage needs.
  • Accept that modification of the ideal dimensions may be necessary. Examples of this might be data that’s a few years old, or that’s been aggregated at a broader level, or using proxy variables (e.g. per capita GDP as a measure of “standard of living.”

Search

Search Strategy

  • Conduct a literature search to determine what datasets were used in the past research to study the same topic, but be aware of the limitations of this approach.
    • Citation of data is not as standardized as bibliographic citations and is often incomplete or ambiguous.
    • The datasets used in published research aren't always available for other researchers. Reasons include:
      • Legitimate issues like confidentiality (e.g. patient data) or copyright.
      • Intentional decisions not to publish data due to such factors as lack of incentive or the absence of a culture of sharing in a discipline.
    • Don't rule out personally contacting a researcher to inquire about the availability of their data.
  • Seek help from a campus librarian or consultant.
  • Search the web (or library databases) directly to find relevant data.
    • When searching, carefully consider who is likely to collect the type of data you want and how it was likely collected. Examples of who might collect data are academic researchers, government agencies, NGO's, IGO's, or think tanks. Typical collection methods include surveys, administrative records, lab experiments, or environmental sensors.
    • Try searching a specificresearch data repository.
    • Don’t neglect your library collection (and specialist librarians).
    • Novel data acquisition practices like “web-scraping” are becoming popular, but understand the copyright or licensing terms before using.
  • Develop a systematic approach to assessing discovered datasets. It should include:
    • Assessment of accompanying documentation (Is the documentation adequate…?)
    • Computing requirements for initial assessment of the data (Can you open the file?)
    • Identification of basic file characteristics like format, record layout, and size.
    • Usage restrictions and approval requirements.

Campus Resources

There are a number of resources for finding data online and on campus. The list below is not exhaustive, but provides pointers to data repositories, guides, and campus resources where you can get help finding data.

DASH - Dash is a UC-based self-service tool where you can discover data sets that have been uploaded and shared by other researchers.

Hathi Trust Research Center - The Hathi Trust Research Center (HTRC) provides research access to the public domain text corpus of the HathiTrust Digital Library. The HTRC provides an infrastructure to search, collect, analyze, and visualize the full text of nearly 3 million public domain works and is intended for nonprofit and educational researchers.

Digital Public Library of America - The Digital Public Library of America (DPLA) maintains an open API to encourage use of data contained in the DPLA platform of close to 12 million items (and growing) which range from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science.

UCB Libraries Data Lab - The Library Data Lab offers consultations on research involving numeric data, including finding and recommending data sources and advising on technical data issues such as file format conversion, web scraping, and basic statistical software use.

D-Lab Data Resources - The D-Lab helps Berkeley faculty, staff, and graduate students move forward with world-class research in data intensive social science. UC Data, which is now part of D-Lab, provides access to a broad range of computerized social science data to faculty, staff, and students at UC Berkeley, and helps researchers understand the content and context of social science data, including geography, weighting, complex designs, and missing data.

GeoData@UC Berkeley - The UC Berkeley Libraries' geoportal where users can search, preview, display, map, and download geospatial data in a variety of formats including shapefiles, KML, and raster data formats. GeoData@UC Berkeley is part of the OpenGeoportal project.

Library data collections - The Library provides access to extensive databases and electronic resources on various subjects. Check out the subject guides in your discipline for information on resources available to you.

Earth Sciences Library - The Earth Sciences and Map Library supports the teaching, research, and learning needs of the Department of Earth and Planetary Science, Department of Geography, and Seismological Laboratory. They can help you locate data on topics in physical geography and the geosciences including structural geology, tectonics, oceanography, seismology, geochemistry, glaciology, geophysics, atmospheric science, planetary science, geomorphology, climatology, and cartography.

Geospatial Innovation Facility Data Resources - The Geospatial Innovation Facility (GIF) at UC Berkeley's College of Natural Resources provides leadership and training across a broad array of integrated mapping technologies, including analysis and visualization of spatial data, application development, state-of-the-art geospatial and web technologies, and opportunities for researchers to learn how they can use spatial data.

Haas School Business Library - The Thomas J. Long Business Library is a hub for business information at UC Berkeley. They source the highest quality resources -- academic and professional, print and online -- for conducting business research. A reference librarian is available to help you find the data you need.