Lena H. Sun and Scott Wilson in the Washington Post have a pretty good article about the HealthCare.gov mess that tells us more about what’s going wrong. Let’s start with the specification for the load:
CGI built the shopping and enrollment applications to accommodate 60,000 users at the same time. U.S. Chief Technology Officer Todd Park has said that the government expected HealthCare.gov to draw 50,000 to 60,000 simultaneous users but that the site was overwhelmed by up to five times as many users in the first week.
That sounds reasonable, and it sounds like the sort of meaningless transient problem I was talking about on day 1. This is day 23 however, and the system is still failing, and was apparently doomed to do so, as I speculated earlier. Sun and Wilson’s article gives us a better idea how that happened, starting with stress testing under a synthetic load:
Days before the launch of President Obama’s online health insurance marketplace, government officials and contractors tested a key part of the Web site to see whether it could handle tens of thousands of consumers at the same time. It crashed after a simulation in which just a few hundred people tried to log on simultaneously.
Despite the failed test, federal health officials plowed ahead.
If it never passed its stress tests, then it’s not surprising what happened when it opened for business on October 1:
When the Web site went live Oct. 1, it locked up shortly after midnight as about 2,000 users attempted to complete the first step, according to two people familiar with the project.
That’s way below the design spec of 60,000 users and potentially a sign of serious problems. On the first day, it could still be a glitch — misconfigured load sharing, caching disabled, some performance option turned off by accident — although the stress tests, had they passed, should have caught most of that. That there are still performance problems is a sign of deeper issues.
The president’s remarks reflected rising anxiety within his administration over the widening problems with the online enrollment process. “There’s no excuse for the problems,” he added, “and they are being fixed.”
To me, this sounds like a fairly routine software disaster. That’s not exactly an excuse, but this is hardly an unprecedented event. Nor was it unpredictable, given the combination of scope and deadline. And indeed, it was predicted:
The Centers for Medicare and Medicaid Services (CMS), the federal agency in charge of running the health insurance exchange in 36 states, invited about 10 insurers to give advice and help test the Web site.
About a month before the exchange opened, this testing group urged agency officials not to launch it nationwide because it was still riddled with problems, according to an insurance IT executive who was close to the rollout.
As a software engineer, probably the thing I find most damning is this:
Some key testing of the system did not take place until the week before launch, according to this person. As late as Sept. 26, there had been no tests to determine whether a consumer could complete the process from beginning to end: create an account, determine eligibility for federal subsidies and sign up for a health insurance plan, according to two sources familiar with the project.
This was the core use case for the system, the primary path for users. And no one had tested it until five days before the site opened. This is another sign of exactly the sort of late integration problem I was speculating about in my previous post. And in case you’ve ever wondered, this is exactly why so many software products are still shipped late; because it’s better to slip the delivery date than deliver a broken system.
But with the date set by Congress (or the implementing regulations), there wasn’t much to do
People working on the project knew that Oct. 1 was set in stone as a launch date. “We named it the tyranny of the October 1 date,” said a person close to the project.
They do seem to be working their way through the list of problems:
Initial problems centered on account registration, a function that takes place early in the process and was in part a responsibility of contractor QSSI. While that function has improved, it is not fixed, according to the person close to the project.
QSSI said that a critical component that involves identity management is “successfully handling current volumes,” said Matt Stearns, a spokesman for UnitedHealth Group, the parent company. He said the “entire federal marketplace” was overwhelmed by consumer interest at launch.
Of course, now that lots of users are making it past the first bottleneck, they are crowding into the next one:
Additional problems are now showing up in the shopping and enrollment parts of the process, applications that are largely the responsibility of CGI, the person said. Those issues would have shown up earlier if testing had been done sooner, the person said.
Yeah. That sounds about right.
This part is a little disturbing:
Obama said government officials are “doing everything we can possibly do” to repair the site, including 24-hour work from “some of the best IT talent in the country.”
(Actually, it’s a little disturbing that WaPo would put a link to one of their own stories into a quote from the President. He certainly didn’t say the link. And how do they know he was talking about the same thing their story was talking about?)
So is this:
“We are working around the clock to identify issues with the site, diagnose them and fix them,” said Joanne Peters, a spokeswoman for Health and Human Services.
If they’re talking about routine 24/7 operations staff, then they’re being deceptive. But if they’re talking about working the software development staff for long hours, then they’re pushing their people beyond the limits and are probably suffering productivity losses by now.
According to another story by Amy Goldstein, the government is bringing in extra staff to help:
The Obama administration said Sunday that it has enlisted additional computer experts from across the government and from private companies to help rewrite computer code and make other improvements to the online health insurance marketplace, which has been plagued by technical defects that have stymied many consumers since it opened nearly three weeks ago.
The additional staff may not be terribly helpful due to Brooke’s Law, a well-known observation in software development usually expressed as, “Adding manpower to a late software project makes it later.” This counter-intuitive result has several causes.
First of all, many tasks can only be subdivided so far — nine women cannot make a baby in one month — so it may not be easy to find tasks for the extra staff to do.
Also, it takes time for software engineers to ramp-up on a project — learn what all the existing pieces are, learn the procedural steps of the job, integrate into the teams — and teaching them these things soaks up the time of other members on the team, which is why growing a software team always slows it down at first.
Finally, any time you make a team larger, it increases the amount of overhead, producing diminishing returns to team size, which reduces the benefits of adding extra staff. Optimizing the teams may mean breaking up large teams into smaller ones, which will add to the ramp-up time as the teams adjust to their new roles. Eventually, the increased staff should make the team more productive, but it could take months.
It’s possible that the new teams are being brought in for some very specific, targeted purposes, or to take on looming projects in the near future to keep the current development staff focused on the existing problems, but nobody’s saying:
Even now, administration officials are declining to disclose many details about the debugging effort. They will not say how many experts — whom they describe as “the best and the brightest” — are on the team, when the team began its work or how soon the site’s flaws might be corrected.
Uhm, no offense intended, but “the best and the brightest” probably already have other work to do. Or else they should have been brought in six months ago. Not that it would have helped with this project, which appears to have been doomed for some time. I’m sure the “best and the brightest” would have told them that the project was going to blow its schedule, scope, and budget.
Actually, “the best and the brightest” are probably no smarter than people already on the team. Calling them that is just rhetorical fluff. So I can’t help thinking it must really suck to be on the HealthCare.gov development team. They would have seen this train wreck coming for months. And we know they tried to warn officials about the problems. But now that disaster has struck, government officials are effectively calling them “mediocre and stupid.”