500 million lines of code. That’s how big the source code is for HealthCare.gov, according to this article at the New York Times. That number has since been repeated in a CNN editorial by Julianne Pepitone.
That can’t be right.
I know for a fact you can build a healthcare enrollment site in under a million lines of code. HealthCare.gov does more than just handle enrollment, but I don’t believe it does 500 times more.
According to this other New York Times article, principle software development did not begin until early this year. It doesn’t seem plausible that developers could have written 500 million lines since then.
Windows 8 is rumored to be somewhere between 30 and 80 million lines of code, and Microsoft was developing it for over 20 years. Or take a look at this handy graph by web developer Alex Marchant. The codebase for the Debian 5 release with all packages is only about 325 million lines of code, and it includes not only an entire Linux operating system distribution, but also a collection of over 17,000 open source software packages. Could HealthCare.gov really be bigger than that gigantic collection of software?
There’s also the issue of cost. Depending which stories you believe, the HealthCare.gov site cost between $400 and $600 million dollars to build. At 500 million lines of code, that would work out to between $0.80 and $1.20 per line. That’s far too cheap. Lines of code is a pretty squishy metric, but production code should cost somewhere around $10 to $20 per line. If HealthCare.gov is really 500 million lines of source code, it should have cost billions.
I tried a few online COCOMO calculators to see if I could come up with a ballpark cost estimate for a 500 million line program. One of them didn’t allow me to enter a number that large, and the other two both recommended a team of 11,500 people working for 52 years. At a salary of $75,000 per year (neglecting all overhead) that would cost around $45 billion, which is around 100 times the reported cost of HealthCare.gov.
So where did that number come from? The Times just attributes it to “one specialist,” who is otherwise unidentified, without explaining how he arrived at the figure. I don’t know anything more, but I can suggest a few possibilities:
(1) Someone just pulled the number out of thin air. It’s large, it’s impressive, but it’s totally meaningless.
(2) The number could include generated code. Many software development tools take specifications from a programmer and generate code. It could be that large parts of the system are written in a much more compact high-level specification language that is fed into a code generator to create all 500 million lines of code.
(3) Related to (2), the number could include static HTML web content as code. Sometimes, for performance reasons, pages that could be the result of a database query are actually generated in advance and served statically. For example, an insurance pricing system might break down the plan structure into 4 different plan levels across 10 different age bands in 75 geographic regions. Rather than generate each page on the fly as needed, the developers might generate all 1500 possible pricing pages in advance, so they can be served more quickly. If each page is 2000 lines of HTML, that alone could count as 6 million lines of code. Do this a few more times, and it wouldn’t be too hard to get to 500 million lines of “code.”
(Either of the preceding two possibilities would be misleading, because the true system complexity — and therefore development effort — is related to the size of the input to the code generator, not the size of its output.)
(4) The line count could include a lot of duplication for some reason, perhaps due to poor factoring as part of a damn-the-maintenance push to get something out now. For example, maybe each state website has to be customized. The smart and maintainable way to do it is to have all the websites share common code except for the (say) 5% that has to be customized. Thus if the base website is 10 million lines of code, then there would be 9.5 million lines of common code, and about 17 million lines of custom code (a half-million lines for each of the 33 states plus D.C. on HealthCare.gov) for a total of about 25 million lines of code.
However, finding the 5% of the code that has to be changed and designing the remaining 95% so it can be shared across all sites is relatively hard work. (It’s a common software engineering process, but it still takes time and effort.) It might be faster to build one website from 10 million lines of code and then fork off and customize 34 copies — one for each state that uses the federal website — for an apparent total of 350 million lines of code, even though only 25 million lines required effort to develop. (But all of it will require effort to maintain.) Again, it wouldn’t be hard to get to 500 million lines this way.
(5) The line count could represent all the cooperating systems behind HealthCare.gov, including pre-existing ones. One of the most complicated aspects of HealthCare.gov is that it has to interact with a lot of other data sources, including (I’ve heard) the Internal Revenue Service, Homeland Security, the Social Security Administration, the Health and Human Services Department, the Treasury Department, the Department of Justice, and all the insurance carriers.
Although these external data repositories undoubtedly required some work to interact with HealthCare.gov, for the most part the software systems already exist. So perhaps someone was discussing the complexity of all the interactions, and they were asked to estimate the size of the entire interacting system — HealthCare.gov and every system it talks to. 500 million lines might be a reasonable guess for that.
The first three of these explanations are misleading at least, and at worst they are a manipulative attempt to explain the disaster. The fourth and fifth possibilities could be misleading, but are likely the result of miscommunication rather than an attempt to mislead. Of course, there’s always another possibility: (6) I could be wrong, and the system really could have 500 million lines of code. Awful, awful code.
Carl H says
I think the following graphic illustrates your point nicely:
http://www.informationisbeautiful.net/visualizations/million-lines-of-code/
I believe scenario #5 is most likely and as you: the result of a miscommunication rather than an attempt to mislead.
Long time reader, first time commenter. I very much enjoy your postings and have learnt a lot from them, thanks.
Mark Draughn says
That’s a terrific illustration of the silliness of the 500 million line number. It’s just so much larger than those other gigantic software systems. It has to be either pulled out of thin air or the combination of many other systems. They say there are 50 billion lines of COBOL that run the world. I’d be willing to bet that 500 million of them are involved in one way or another with HealthCare.gov.
satex says
The New York Times article says 5 million lines of code, FIVE. Not 500. What caught my eye here and caused me to actualy read the article is that 500 million lines of code would mean that ‘the government’ would have decided there would be national health care back during Reagans administration and every programmer on the planet would be aware of it’s coming implementation. Now granted, I’m sure that code was started being written long before Obama came along, as were the several hundred pages of legislation, but 500 million lines of code is an absurd number. It’s 5 million.
Mark Draughn says
The Times article says 5 million lines of code may have to be rewritten, but later it includes an estimate from an unnamed source of 500 million total lines.
You make a good point about when such a project would have to be started, which is another argument for option (5) that 500 million lines of code is all the cooperating systems combined, not just the new ACA code. Some of those systems probably have code bases that were started back in the 1960’s.