The media and various election analysis organizations have called this election for Biden, but some Trump supporters have been alleging fraud based on irregularities in the vote counting process. I don’t know a lot about the post-election tabulation process, but I know a thing or two about raw data.
I spent about 10 years of my life working for a company that provided outsourcing services for employee benefits administration. You know how every year you have the opportunity to change your medical insurance, dental plan, life insurance, and so on? For a fee, this company would handle all that on behalf of moderately large companies, so their Human Resources department wouldn’t have to deal with all the details.
In the early years, part of my job was to provide software and SQL scripts to support the process of onboarding new clients who had been using some other benefits enrollment system. The clients would generate export files or reports or whatever they could get from the system they were currently using, and we would have to upload that into our system. We also had to upload coverage and pricing information from insurance vendors, so the system could present employees with accurate pricing for their benefit choices.
It was a mess. No matter how strenuously the client insisted their data was clean, our import process would inevitably kick out records that had problems, and we would send the customer a variety of exception reports explaining what needed to be fixed.
- “Why do these six employees have spouses of the same sex?” (This was before that was legal and accepted by insurance plans.)
- “Fourteen people have spousal medical coverage for a dependent coded as a child. For each one, which is wrong? The spousal coverage level or the child relationship?”
- “These employees are in Group A and the Premier coverage level, but the insurance pricing file doesn’t include Premier coverage for Group A employees. Do we downgrade their coverage, or do you need to get us a new insurance pricing file?”
- “This person has two children with the same name and birthdate. Can we assume they are duplicates or do you want to check with the employee?”
And so on.
Keep in mind that this wasn’t raw, hand-entered data. It had been exported from another computerized benefits administration record-keeping system, which presumably had data integrity checks of its own. And still there were almost always exceptions, where people had inadvertently snuck bad data into the system.
More recently, I’ve been processing data about the Covid-19 pandemic to generate charts that I post to Twitter.
I pull data from four different sources: The U.S. Census Bureau, the Covid Tracking Project, the Center for Systems Science and Engineering (CSSE) at Johns Hopkins, and the Illinois Department of Public Health (IDPH). Each one is formatted differently, and the quality varies. Census data is immaculately curated, but the format is weird by modern standards. The IDPH data is in a more modern format, but there are oddities. The Covid Tracking and CSSE data is assembled, often by hand, from a variety of sources, so a quantity such as number of tests can have a different meaning depending on its source — e.g. do confirmation tests count as another test or are they bundled into the first one?
Almost all the data follows a weekly cycle of highs and lows, presumably as data entry falls behind on the weekends and then catches up during the week, so I usually smooth it with a 7-day rolling average. Sometimes values change abruptly — perhaps because lost data is finally entered, or when the definition of the data changes (e.g. reporting a new type of test). There are days where the counts are the same as the day before, because no one entered new data. These are often followed by days where everything changes twice as much, as the report catches up to reality.
Data about humans, created by humans, processed by humans, and reported by humans, is something of a mess.
Data scientists love to show off fancy models, cool machine learning applications, and elegant visualizations. But an awful lot of the real work of data science is cleaning the data. For example, the rt.live project, which calculates real-time estimates of Rt for Covid-19 in every state makes dozens of corrections to the raw data before feeding the model. Data science textbooks have chapters on data preparation, and every major data science toolkit includes facilities for cleaning up data.
My point is that elections are just another example of data collection — created by humans, processed by humans, and reported by humans — and there’s going to be a mess. Ballots will be mis-read and mis-recorded. Some ballots will be lost, sometimes in ones and twos, and sometimes by the boxload. It’s a messy process, and things will have gone wrong.
I’m not trying to excuse fraud here, and an accurate count should be the goal. Errors should be corrected. Misplaced ballots should be found and counted. Illegal votes should be discarded. But not every human error is a sign of fraud.