EPIC SOFTWARE FAILURES David Stotts UNC Computer Science
EPIC SOFTWARE FAILURES David Stotts UNC Computer Science Bugs vs. Poor Design Some failures come from hackers exploiting design flaws Some failures come from logic flaws Some failures come from deliberate decisions made by programmers (Y2K)
Some might not be accidental failures Other sources of failure Some failures come from code that accurately implements incorrect or inconsistent specs Some failures come from code that incorrectly implements correct specs Some failures are allowed by (not caused by) incomplete testing (incompetence or by plan)
Some failures come from mix of old code with new old code mismatch with code from new specs Mariner I Venus Probe July 22, 1962 Collaboration among JPL, NASA, USAF Bug in guidance system software causes rocket to diverge from intended path on launch
Range safety officer destroys it over the Atlantic Investigation finds that a formula written in pencil on paper was incorrectly transcribed into computer code wrong punctuation character in a single line of code Mariner I Venus Probe July 22, 1962 Caused computer to mis-calculate the rockets trajectory
Exact error is unclear different stories are around best account seems to be a missing overbar in an equation when it was coded Overbar became the most expensive hyphen in history for public consumption Details http://en.wikipedia.org/wiki/Mariner_1 Mariner I Venus Probe July 22, 1962
What is the failure mode then? Code and requirements/specs dont agree Coder to blame? Or requirements writer? Specs on a napkin? Note that this event is prior to the days where the specs/requirements were laid out specifically as a step in development
Second launch Two days after that launch, the backup probe and booster (Atlas vehicle 179D) were rolled out to LC-12. Mariner 2 nearly met the same fate as its predecessor when one of the Atlas's verniers moved to maximum stop shortly before booster engine cutoff. This caused a rapid roll of the launch vehicle that quickly approached one revolution per second. With the structural integrity of the booster in jeopardy, the range safety officer prepared to issue the destruct command almost as soon as it started, the rolling motion stopped and the launch proceeded uneventfully. The incident was traced to a
loose wire in the guidance computer which was pushed back into place by the centrifugal force of the roll. Trans-Siberian Gas Pipeline 1982 Not clear this is an accidental error alleged deliberate sabotage Not clear its even true tale of CIA, KGB, Canada
CIA got word of Soviet stealing Canadian control systems for huge trans-siberian gas CIA put bug in software before it was stolen Trans-Siberian Gas Pipeline 1982 Bug causes pumps to overwork and pressure controls to fail super overpressure and boom
Visible from space, biggest non-nuclear explosion ever Details http://en.wikipedia.org/wiki/Siberian_pipeline_sabotage Trans-Siberian Gas Pipeline 1982 What is the failure mode then? Code worked as specd, requirements were correct Code was deliberately deceptive
Appeared as normal controller, while behavior was different from what monitors showed And now Stuxnet? Stuxnet virus June 2010 Specifically attacks Siemens Step7 software running under Windows Code for centrifuge control (uranium enrichment) Ruined 1/5 of Irans
centrifuges Created overspeed in the rotors while sending good normal readings to the monitors Stuxnet virus June 2010 What is the failure mode? Another example of a deliberately deceptive program Therac-25 Medical Accelerator 1985-1987 Radiation therapy device malfunctions,
delivers lethal doses at several facilities The 25 was an improved version of an older model It could deliver betaparticles (electron beam) or x-rays Therac-25 Medical Accelerator 1985-1987 Older model mechanical failsafes were replaced with modern software fail-safes It was possible to set the electron beam onto X-ray strength before the metal target
was in place Therac-25 Medical Accelerator 1985-1987 Researchers who investigated the accidents found the following institutional causes: AECL did not have the code independently reviewed. AECL did not consider the design of the software during its reliability modeling. The system documentation did not adequately explain error codes. AECL personnel initially did not believe complaints. Therac-25 Medical Accelerator 1985-1987 Investigators also found several engineering issues: No hardware interlocks to prevent the electron-beam from operating in its high-energy mode without the target in place.
Reused software from older models that had hardware interlocks and were therefore not as vulnerable to the software defects. Hardware provided no way for the software to verify that sensors were working correctly. The equipment control task did not properly synchronize with the operator interface task, so that race conditions occurred if the operator changed the setup too quickly. This was missed during testing, since it took some practice before operators were able to work quickly enough for the problem to occur. Software set a flag variable by incrementing it. Occasional arithmetic overflow caused the software to bypass safety checks. Therac-25 Medical Accelerator 1985-1987 What is the failure mode?
Old code, new specs Failure to code review Failure to test thoroughly Lack of mechanical safety (no software will be perfect) Morris Worm Nov 2, 1988 Early malware, infected 10% of the computers on
the arpanet (no real internet at the time, mostly universities and govt. labs, some companies) Cleanup costs at each site ranged from $200 to $53,000 (in 1988 $$) 99-line program written to infiltrate DEC VAX and Sun 3 machines and do nothing but make copies of itself Robert Morris was a 1st yr grad student at Cornell Morris Worm Nov 2, 1988
Experiment gone wrong no malicious code in it, Morris goal was to show that sendmail could be used to propagate a worm (demo a software flaw) Exploited buffer overflow in the Unix finger daemon and in sendmail, others using gets() Gets() does not check its input string to see if it fits the allocated char array so send too much input and you get into memory areas beyond the program Bogged machines down to near stoppage with simply
servicing the worms behavior copying itself Morris Worm Nov 2, 1988 Worm error: asked if a machine had the worm, no copy if yes. However, every 7th yes it copied anyway (to prevent admins killing the worm by just saying yes). 7 was not slow enough. Not nearly Multiple software errors gets(), morris 7 try limit, failure to test before release Effective due to homogeneity of the 80s net
Details here http://www.eweek.com/c/a/Security/Who-Let-The-Worm s-Out/ Morris Worm Nov 2, 1988 What is the failure mode? Failure was intentional, exploiting OS design flaw No testing in isolation for intentional
behavior Poor estimation of parameters (speed) Kerberos Random Number Gen 1988-1996 network authentication protocol, used in OS and shared file system manager Secret key encryption, used random num generator used to create user tickets
Random generator was predictable, so user could be spoofed and files accessed Found at Perdue by Gene Spafford students similar to issue with Netscape random number flaw in SSL, 1994 Details http://tech.mit.edu/V116/N11/kerberos.11n.html Kerberos Random Number Gen 1988-1996 What is the failure mode?
Dependency on unreliable random number generator Random number generator did not correctly implement its specs (not random enough) AT&T Network Outage January 15, 1990 A bug in a new release of the software that controls AT&T's #4ESS long distance switches causes these mammoth computers to crash when they receive a specific message from one of their neighboring machines
The problem message? One that the neighbors send out when they recover from a crash oops AT&T Network Outage January 15, 1990 The first domino a switch in New York crashes and reboots, causing its neighboring switches to crash, then their neighbors' neighbors, and so on. 114 switches fail, crashing and rebooting every 6 seconds
60,000 customers with no service for 9 hours The fix: reload previous software version Details http://www.phworld.org/history/attcrash.htm AT&T Network Outage January 15, 1990 What is the failure mode?
Insufficient testing of new software before release bug in message handling was not found Kismet huge crash cascade failure was pure bad luck that the crash recovery message was the bad one Perhaps testing can try to anticipate risks like this in the code Patriot Missile Failure Feb. 1991
American Patriot Missile battery in Dharan, Saudi Arabia, failed to track and intercept an incoming Iraqi Scud missile It hit a barracks, 28 killed, over 100 wounded Tracking error came from two different versions on internal clocking routing; one was from older missile system (from assembly code), another was an updated routing for newer faster missiles. The mismatch created a time skew large enough to allow the Scud to be kilometer off its computed location
http://www.ima.umn.edu/~arnold/disasters/patriot.html Patriot Missile Failure Feb. 1991 What is the failure mode? Mismatch of new specs and old specs Intel Pentium FP Divide 1993 Design error causes Intel's highly promoted Pentium chip to make mistakes when dividing floating-point numbers that occur within a specific range.
Example: dividing 4195835.0/3145727.0 yields 1.33374 instead of 1.33382, an error of 0.006 percent Caused by a few missing values in lookup table http://www.intel.com/support/processors/pentium/sb/C S-013007.htm Discovered by Prof. Thomas Nicely, Lynchburg College in VA. http://www.trnicely.net/pentbug/bugmail1.html Intel Pentium FP Divide 1993
Affects few users, but it becomes a public relations nightmare. 3 to 5 million defective chips in circulation Intel offers to replace chips only for customer who can prove they need high accuracy !! Relents and replaces chip for any customer
Cost: $475 million Intel Pentium FP Divide 1993 Intel makes some recalled defective chips into keychains (oh yeah, they have $450 million to recoup) http://pinterest.com/chipsetc/cpu-computer-chip -keychains/ http://www.chipsetc.com/intel-keychains-page-3 .html For fun
http://www.maa.org/mathland/mathland_5_12.html Intel Pentium FP Divide 1993 What is the failure mode? Missing data values in lookup table, software? Or hardware? Firmware Testing? Can testing be made exhaustive at least on lookup table values? The Ping of Death 1995/1996
denial of service (DoS) attack caused by an attacker deliberately sending an IP packet larger than the 65,535 bytes allowed by the IP protocol Normal ping packet is 32 to 84 bytes Long packets are broken up and sent as smaller ones, then reassembled no check was done to make sure they add up to <= 65,535 Affected Unix, Linux, Mac, Windows, printers, and routers running early TCP/IP implementations
The Ping of Death 1995/1996 OS didnt know how to handle a too long packet and would freeze, crash, do unpredictable stuff Yet another buffer overflow problem -- a classic in C due to the nature of arrays (buffers) in the language Newer versions of TCP/IP are fixed, but some net firewalls still deny pings because of this
The Ping of Death 1995/1996 What is the failure mode? Deliberate exploit of design flaw in OS Flaw is fixed but no way to force old code to be eliminated Ariane 5 Flight 501 June 4, 1996 French rocket reuses
code from Ariane 4 5s faster engines triggers bug in arithmetic routine in flight computer Convert 64-bit FP to 16-bit signed int Ariane 5 Flight 501 June 4, 1996 The 5 generates larger 64-bit values, causes overflow in the 16-bit int ( a variable representing horizontal bias )
Controller traps due to overflow, crashes during launch, so backup computer is called into play Backup had already crashed it was running the same software (redundancy was incase of hardware failure) Boom 37 seconds into flight (unmanned), losing $370 million rocket and scientific cluster payload Video http://www.youtube.com/watch?v=gp_D8r-2hwk
Ariane 5 Flight 501 June 4, 1996 From Pentium chip error link http://www.maa.org/mathland/mathland_5_12.html Interestingly, the launch failure of the Ariane 5 rocket, which exploded 37 seconds after liftoff on June 4, 1996, occurred because of a software error that resulted from converting a 64-bit floating point number to a 16-bit integer. The value of the floating point number happened to be larger than could be represented by a 16-bit integer. The overflow wasn't handled properly, and in response, the computer cleared its memory. The memory dump was interpreted by the rocket as instructions to its rocket nozzles, and an explosion resulted. Ariane 5 Flight 501 June 4, 1996 What is the failure mode?
Old code not compatible with new specs yet new system mixed old and new code Mars Pathfinder Shutdown Aug 1997 Onboard computer would just shutdown and reboot happened many times early on in the mission Traced to a timing problem in multi-taking operating system
Processes have priorities and very occasionally a med priority task could start in the tiny time slice when a high priority task was being interrupted for a low task The med task would prevent the low from running, and the high was waiting for the low to finish and give it data Mars Pathfinder Shutdown Aug 1997 Software handled this by noticing a stuck task and rebooting Parallel system on Earth was used to find the bug, patches sent up to Mars
No huge dollar loss but each reboot caused a day delay in data transmission https://www.rapitasystems.com/blog/what-really-happ ened-to-the-software-on-the-mars-pathfinder-spacecr aft Mars Pathfinder Shutdown Aug 1997 What is the failure mode?
Inconsistent / incorrect specs, correct code perhaps Perhaps incorrect multi-processor design allows race conditions, or OS for multi-processor Human error: Engineers later confessed that system resets had occurred during pre-flight tests. They put these down to a hardware glitch and returned to focusing on the mission-critical landing software Natl Cancer Inst., Panama City November 2000 Radiation planning software by Multidata
Systems International Technicians draw on screen locations of blocks that shield healthy tissue from radiation Software allows 4 blocks, Panama docs usually want to use 5 Discover they can trick the system by drawing all 5 blocks as one large block with a hole in the center Natl Cancer Inst., Panama City November 2000
Problem: draw hole in one direction, get correct gamma ray dosage calculated Draw hole in the other direction, get 2x the proper gamma ray dose 8 patients die, 20 badly injured Physicians, who were legally required to double-check the computer's calculations by hand, are indicted for murder.
Details http://www.baselinemag.com/c/a/Projects-Processes/ We-Did-Nothing-Wrong/ Natl Cancer Inst., Panama City November 2000 What is the failure mode? undocumented feature purposeful or not, from user POV But this was really code error, not well
tested Code with larger user base these are found and fixed (open source e.g.) Mars Climate Orbiter September 23, 1999 Remember in physics class, when the instructor was so adamant about units? You gave an exam answer of 2.5 and
they would write in red all over it 2.5 what?? Weeks? Puppies? Jelly Donuts? Turns out it can really matter Mars Climate Orbiter September 23, 1999 Ground-based software produces output in pound-seconds (imperial) instead of the newtonseconds (metric) specified in the NASA contract with Lockheed Flight controller on Orbiter written to take thrust instructions in metric units; ground software generating those instructions sends them in
imperial units What we have heyah is failure to communicate Mars Climate Orbiter September 23, 1999 Wrong thrust causes wrong orbit, craft ends up too low and contacts the atmosphere, burns Mars Climate Orbiter September 23, 1999
The discrepancy between calculated and measured position, resulting in the discrepancy between desired and actual orbit, had been noticed earlier by at least two navigators, who were ignored. A meeting of trajectory software engineers, trajectory software operators (navigators), propulsion engineers, and managers, was convened to consider the possibility of executing Trajectory Correction Maneuver-5 (in the schedule). Attendees of the meeting recall an agreement to conduct TCM-5, but it was ultimately not done. Mars Climate Orbiter September 23, 1999
What is the failure mode? Inconsistent specs, code correctly matches the specs LAX Grounded 2007 A single faulty piece of embedded software, on a network card, sends out faulty data on the United States Customs and Border Protection network, bringing the entire USCBP system to a halt (morrisworm like, a bit)
Nobody is able to be authorized to leave or enter the U.S. from the Los Angeles Airport for over eight hours. Over 17,000 planes grounded for the duration of the outage DHS report LAX Grounded 2007 What is the failure mode?
Network card failure firmware? Aging systems at end of replacement lifetime World War III 1983 Soviet early warning satellites picked up sunlight reflections off cloud-tops and mistakenly interpreted them as missile launches in the United States. Software was in place to filter out false missile detections of this very nature, but a bug in the
software let the alerts through anyway. The Russian system instantly sent priority messages up saying that the United States had launched five ICBMs. World War III 1983 Protocol in such an event was to respond decisively, launching the entire Soviet nuclear arsenal before any US missile detonations could disable USSR response capability. The duty officer for the system, Lt. Col Stanislav
Petrov, intercepted the messages and flagged them as faulty, stopping the Soviet response Petrov said he had a funny feeling in my gut about the attack if the U.S. was really attacking they would launch more than five missiles. World War III 1983 What is the failure mode? Who knows bad spec? More likely code that was incomplete false positives too easy to generate
Y2K of course um 2000 Dates in code written in the early days used 2 digits to represent the year E.g., 10 / 05 / 73 The 19 in 1973 was assumed in the code itself. What happens when a program finds out it
is 00 and thinks it is 1900 ? Y2K of course um 2000 All sorts of dire predictions nuclear plants to fail, government records to become false (payments stop), manufacturing plants to shut down, food distribution failure, energy delivery failure, etc. General societal chaos preppers Made a lot of consulting cash for retired COBOL programmers
Nothing happened. So not a failure perhaps Y2K of course um 2000 What is the failure mode? Incomplete specs, assumptions that become invalid Poor estimation of product lifetime
Very expensive non-failure Healthcare.gov Oct. 2013 Website to integrate data and functions from users, insurance companies, and government databases 30-some different versions for different states High-traffic site, handles a few dozens users first day
No testing to speak of done $600 million costs of development Healthcare.gov Oct. 2013 Front end developed by a startup company Back end developed by CGI (a Canadian company that has developed a large system for Canada, was estimated to cost $2 million
and ended up costing $2 billion) Coordination? Integration testing failure Function testing end to end? No. Stress testing? No. Healthcare.gov Oct. 2013 What is the failure mode?
No testing to speak of Two different companies working on major chunks, not interacting Future cars? and beyond Cars these days are computer networks wireless networks Can be hacked from outside and hijacked
External control of throttle, brakes, fuel mix, ingition, lights, etc. Software bugs that allow this takeover? Software system security will be huge problem in society for time to come IOT and security Internet-enabled devices are so common, and so vulnerable,
hackers broke into a casino through its fish tank. The tank had internet-connected sensors measuring its temperature and cleanliness. hackers got into the tanks sensors, then to the computer used to control them, and from there to other parts of the casinos network. The intruders were able to copy 10 gigabytes of data to somewhere in Finland. Devices made to share
Whether you know it or not Our fish tanks, smart televisions, home thermostats, Fitbits, smartphones constantly gather information about us and our environment. Programmers make sure they are data sharing devices your info sent all over And Alexa shes listening Do you mind?
Roomba robotic vacuum cleaner Since 2015, high-end models have created maps of its users homes, to more efficiently navigate through them Local data? No. IOT device Reuters and Gizmodo reported, iRobot planned to share those maps of the layouts of peoples private homes with its commercial partners
Others. the Mirai botnet (2016) took control of smart home devices, like security cameras, home routers, all over the world turned them into zombie machines to do DDoS attacks to take down popular websites like GitHub, Twitter, Reddit, Netflix, Airbnb OK eye rester with unlimited data
The memorial service is a service given for the deceased without the body present. This may take place before or after a burial, donation of the body to science, cremation (sometimes the cremains are present), or burial at sea. Typically...
Bob Hlynosky, Accounts Payable/Procurement Ext 2494. Vickie Voss Ext 2210. Systems. Scott Klanecky, Finance Systems Manager Ext 5558 ... RBC - Must be completed and turned in by 5/15/19. JV. Payroll Redistribution. Keep slide for FYE18? Stacey Eve 243-6342. Nate...
2 MWe Unit. Whitesand First Nation Case Study. Organic Rankin Cycle (ORC) Below 10 MWe, for biomass power plants the capital cost associated with the per MWe generated and technology reduces their attractiveness historically with project capital cost rising to...
ANFOG: Australian National Facility for Ocean Gliders Basics of an Ocean Glider Related to ARGO floats - but with wings Powered by batteries (C or D cells) Buoyancy engine - pumping oil or water into and out of a bladder...
U.S. Marine Corps Forces Command. Norfolk, Virginia . Welcome Aboard . Congratulations your assignment to U.S. Marine Corps Forces Command, Norfolk, Virginia located aboard Naval Support Activity Norfolk adjacent to the North Atlantic Treaty Organization (NATO) Allied Transformation Command.
Inductive Argument Identifying and Evaluating Inductive Arguments ... Causal Generalization Draws a conclusion about an observed relationship, i.e., that this relationship will always occur, on the basis of previously observed instances of the relationship. Example at time 1, y follows...
Ready to download the document? Go ahead and hit continue!