Until someone bigger and better does it, I’m making available a periodically-updated file of historic US vaccination statistics based on data from the CDC:
This is all running on my personal desktop PC right now, and I maintain the code in my spare time, so I make no promises, especially if the CDC changes the format of this undocumented data source.
The CDC website displays vaccine statistics here:
You’ll notice that page only displays current data. I have a crawler script that hits the back-end data source for that page several times per day.
The CDC data provides a run_id field which I assume represents reporting runs from the CDC’s databases. I’m building up a collection of historic snapshots by saving a new snapshot every time the script notices the run_id date pf the first line item has changed. I started collection on January 2, 2021, but some early data has been lost.
After downloading the latest snapshot, the script loads all the saved snapshots and combines them into a single data set, removes duplicates (records having the same Location and Date), and normalizes date representation (don’t ask). Other than that, this is the the data I scraped from the CDC as they presented it. If something is missing, it’s because I didn’t get it.
After processing, the data is formatted as JSON and CSV and uploaded to the cloud for sharing.
This is an undocumented back-end data source, so there’s no documentation. However, most of the fields appear to be self-explanatory.
2021-01-02: I started collecting data.
2021-01-07: Breakout by manufacturer added.
2021-01-12: Breakout by dose (1 or 2) added.
2021-01-13: Manufacturer and dose 2 partially removed, sometimes.
2021-02-10: Noticed that my back-end CDC data source had started reusing run numbers, which caused some old data to be lost. I switched to picking out the data of the first data item to detect a file change. I’ll see how that works.
- Added .csv version of the data.
- Content-Type now set correctly.