Datasets for Building a Data Analysis Portfolio

I recently had the pleasure of attending the 2017 Association of Public Data Users (APDU) Conference.

My favorite part of the conference was talking to people who work with federal data on a daily basis. Overall I found people to be passionate about their work and eager to share information about it.

I know many of my readers are looking for interesting datasets to use in their portfolios, so I decided to publish a list of some of the most interesting datasets I learned about.

IRS Statistics

One of the most enjoyable conversations I had was with Kevin Pierce, an economist with the IRS. When I first learned where Kevin worked, I wanted to run away. However, we wound up having a fascinating conversation about IRS data.

Kevin works in the IRS’s Statistics of Income (SOI) program. As far as I can tell, SOI data is the highest quality data on US income that’s available. This is simply because all Americans are required to accurately file their taxes every year.

I was surprised to learn that the SOI publishes a great deal of this data. As an example, here is their page dedicated to data from Form 1040. They also aggregate this data by State, County and ZIP Code, so it is possible to map the data.

Kevin works on the IRS’s migration reports. Because the IRS knows everyone’s address and and income each year, they can analyze migration and the financial impact it has.

Any portfolio that focuses on income data from the IRS is sure to get a lot of attention!

Vital Statistics

Charles Rothwell, the Director of the National Center for Health Statistics, appeared on a panel titled “Federal Statistical Agency Leadership”. Charles is a gifted public speaker and I really enjoyed his presentation.

Charles works with “Vital Statistics”, which involves counting births and deaths. Normally I would shy away from a dataset like this. But as Charles pointed out, this data is necessary if you want to understand the opioid epidemic that the US is currently facing.

A portfolio that focused on using this dataset to explore the opioid epidemic would be fascinating to read.

Labor Statistics

Michael Dalton, a research economist at the Bureau of Labor Statistics (BLS), spoke on a panel about the Role of Commercial Firms in Public Data. I found his case studies to be very interesting, and after his talk we chatted for a bit. I asked him which BLS statistics he thought would be good for a data analysis student who is interested in employment data. He had several recommendations:

These statistics will tell you the types of jobs that people in the US have, as well as the amount that people in those occupations earn. If I had the time, I’d love to analyze the growth in the number tech workers in the Bay Area over time.

Of course, BLS also releases statistics on Unemployment. (Note that I have already packaged up some of that data in the rUnemploymentData package (1, 2)).

Michael also recommended checking out IPUMS, which is a resource I also heard about at the ACS Data Users Conference earlier this year.

Energy Data

I also had the pleasure of meeting Chip Berry of the Energy Information Administration (EIA). Chip manages the Residential Energy Consumptions Survey (RECS). I was not previously aware of the EIA, and it turns out that they have a ton of interesting data. For example, they have real-time information about energy supply and demand nationwide. They also know the location of each and every energy production facility in the US.

As I write this much of Florida is still without power due to hurricane Irma. If you were interested in researching this (or any other energy-related topic), this data would be a great place to start.

Closing Thoughts

In my experience, the more specialized a portfolio is, the easier it is for the portfolio to get traction. Each of the datasets I link to above could easily form the cornerstone of a successful data-related portfolio.