In Software Engineering, Sometimes Failure Is the Only Option

On day one of the HealthCare.gov roll-out, I explained that first-day glitches in a large production web site are meaningless. With only a few specialized exceptions (and some lucky ones) things always go wrong on the first day. It’s a normal part of the shakedown process, and not necessarily a reason to get upset. However, just because first day glitches are normal, that doesn’t mean there aren’t also real problems with the site. Only time will allow us outsiders to tell if early problems are roll-out issues or evidence of more serious defects.

It’s been two weeks, and now we know. HealthCare.gov is software disaster, and insiders are starting to talk:

“These are not glitches,” said an insurance executive who has participated in many conference calls on the federal exchange. Like many people interviewed for this article, the executive spoke on the condition of anonymity, saying he did not wish to alienate the federal officials with whom he works. “The extent of the problems is pretty enormous. At the end of our calls, people say, ‘It’s awful, just awful.’ ”

That’s from an excellent short article in the New York Times by Robert Pear, Sharon LaFraniere, and Ian Austen that points to some of the reasons for the failure.

Most of the coverage of the problems at HealthCare.gov has repeated the point that the government had three years to get the website working. That may sound like a long time, but it’s not. I’ve been writing software for over 25 years, including about 8 years working on government contract projects and another 9 years helping to build healthcare enrollment websites. I think that three years is plenty of time for a large-enough team to write the software, build the website, load the data, test the system, and launch our national healthcare enrollment system.

The problem is that they didn’t really have three years.

Deadline after deadline was missed. The biggest contractor, CGI Federal, was awarded its $94 million contract in December 2011.

That leaves a little less than two years. And it’s only the beginning of the problem, because the Affordable Care Act is not a software requirements document. Much of it just established the regulatory process that would spell out the details of all aspects of the healthcare exchanges, including the software requirements. That took time. A lot of time, and maybe for some questionable reasons:

To avoid giving ammunition to Republicans opposed to the project, the administration put off issuing several major rules until after last November’s elections. The Republican-controlled House blocked funds. More than 30 states refused to set up their own exchanges, requiring the federal government to vastly expand its project in unexpected ways.

The result of all this was a very late start:

But the government was so slow in issuing specifications that the firm did not start writing software code until this spring, according to people familiar with the process. As late as the last week of September, officials were still changing features of the Web site, HealthCare.gov, and debating whether consumers should be required to register and create password-protected accounts before they could shop for health plans.

This explains a lot. Changing features during the last week before going live isn’t unusual on any website, but changing major operating rules — like whether you have to login — is a sign of serious problems. (Unless they were aware early on that the rules might change and implemented it both ways.)

In addition to the late start and shifting requirements — problems common to many software engineering projects — large projects like these have their own distinctive problems. Historically, the most common killer of large software projects is integration, and the healthcare exchanges appear to have foundered there:

One highly unusual decision, reached early in the project, proved critical: the Medicare and Medicaid agency assumed the role of project quarterback, responsible for making sure each separately designed database and piece of software worked with the others, instead of assigning that task to a lead contractor.
[…]
While some branches of the military have large software engineering departments capable of acting as the so-called system integrator, often on medium-size weapons projects, the rest of the federal government typically does not, said Stan Soloway, the president and chief executive of the Professional Services Council, which represents 350 government contractors. CGI officials have publicly said that while their company created the system’s overall software framework, the Medicare and Medicaid agency was responsible for integrating and testing all the combined components.

These problems should have been obvious to project managers. And they were:

Confidential progress reports from the Health and Human Services Department show that senior officials repeatedly expressed doubts that the computer systems for the federal exchange would be ready on time, blaming delayed regulations, a lack of resources and other factors.
[…]
By early this year, people inside and outside the federal bureaucracy were raising red flags. “We foresee a train wreck,” an insurance executive working on information technology said in a February interview. “We don’t have the I.T. specifications. The level of angst in health plans is growing by leaps and bounds. The political people in the administration do not understand how far behind they are.”
The Government Accountability Office, an investigative arm of Congress, warned in June that many challenges had to be overcome before the Oct. 1 rollout.
“So much testing of the new system was so far behind schedule, I was not confident it would work well,” Richard S. Foster, who retired in January as chief actuary of the Medicare program, said in an interview last week.

The response from higher officials just kills me:

But [the chief website architect’s] superiors at the Department of Health and Human Services told him, in effect, that failure was not an option, according to people who have spoken with him.

Sorry, no. Software engineering just doesn’t work that way. No amount of willpower, positive thinking, or self-confidence will make a failing software project into a success. Neither will threats of unemployment. In fact, once a project is well into the development phase, decades of experience show that it’s almost impossible to turn around a project that is late and over budget.

In the early days of software engineering, we used to call this the software crisis. As computers got more powerful, and more able to communicate with each other, it became possible to run much larger software systems on them than ever before. But as the software projects got larger, more and more of them started to fail by falling behind schedule, going over budget, being riddled with defects, or all three. As I mentioned earlier, the integration phase was a big problem. Quite often what would happen is that the teams developing various parts of the software system would make good progress, but when they tried to integrate all the components into a working system, it would fall apart for reasons that were complex and hard to fix.

An even bigger problem arose if the requirements weren’t stable. The project teams would develop requirements documents, they would be approved by the customer (which might be internal), the developers would start coding the system, and then someone would discover that the requirements were incomplete, or wrong, or the customer would decide to change them. This was a huge problem: Changing requirements in the requirements document was easy, but changing requirements after coding had started was time-consuming and expensive.

A lot of effort in the early years was devoted to developing methodologies for capturing requirements and checking them for completeness and correctness, and also for developing thorough design documents that specified all the system interactions, so that software components would integrate smoothly. A few extremely well-run teams had manged to do this very well (e.g. the software team for NASA’s Space Shuttle), and the industry was focused on the idea of improving requirements and design processes as a way out of the software crisis.

This entire software development process — from requirements to design to coding to integration to testing to deployment — was referred to as the waterfall model (because of the diagrams). But if you look at all the major successful websites — Amazon, Facebook, Twitter, LinkedIn — it’s likely that none of them were developed this way.

What happened was that the dominant methodology of software development evolved into something called agile development. Software engineers decided to accept that unstable requirements are an inevitable part of the process and to eliminate the big integration step at the end. They start by building a very small piece of software that works. It is integrated, tested, and deployed (at least on a limited basis). Users get to see it and play with it right away, which allows them to give feedback, which is used to plan the next iteration of development. Each iteration of the development cycle takes somewhere between a week and a month. And at the of each iteration, the development team has a full product that is integrated, tested, and ready to deploy.

Initially, the product is only shown to internal customers — development managers, product managers, company executives. As the developers keep iterating, it slowly acquires new functionality, piece by piece. Changing requirements are no longer a big problem, since reviewing and adjusting them is a built-in step at the end of each iteration, as developers plan for the next iteration. Bad ideas can be discovered and discarded early, and good ideas can be recognized and developed further.

At some point, the product is deemed good enough, and the software is released to the public. Often the initial release has reduced functionality and is released to a limited user group. As the product evolves through iteration after iteration and acquires new functionality, it gets released to larger and larger groups of users until eventually everyone can use it. This slow release method allows the development team to test their ideas in the real world, and it also reduces the stress of suddenly scaling up the system to full size.

Unfortunately, when your product’s functionality and release date are both defined by act of Congress (and its regulatory agents), the iterative method doesn’t help much. Neither do the politics:

Nor was rolling out the system in stages or on a smaller scale, as companies like Google typically do so that problems can more easily and quietly be fixed. Former government officials say the White House, which was calling the shots, feared that any backtracking would further embolden Republican critics who were trying to repeal the health care law.

Critics have been eager to paint the disastrous HealthCare.gov roll-out as a failure of ObamaCare, but the problems are really the result of the government procurement process and of the inability of legislatively-defined software projects to benefit from modern design processes, especially in a political environment in which any major change in requirements would require the approval of a divided Congress.

Reader Interactions

Leave a ReplyCancel reply