Chinaunix首页 | 论坛 | 博客
  • 博客访问: 1686139
  • 博文数量: 230
  • 博客积分: 10045
  • 博客等级: 上将
  • 技术积分: 3357
  • 用 户 组: 普通用户
  • 注册时间: 2006-12-30 20:40
文章分类

全部博文(230)

文章存档

2011年(7)

2010年(35)

2009年(62)

2008年(126)

我的朋友

分类:

2008-04-17 12:55:45

Examples of engineering failures

A study of the example of the Therac-25 will reveal following design difficulties:

  • Software can fail
  • Testing can be difficult, but is indispensable
  • Reuse of software can be dangerous
  • Safety and usability are conflicting design criteria

In this case the mistake of reusing and trusting old software was made. This is typically a mistake made in .

The Therac-25

A computer-controlled machine, called the Therac-25, severely overdosed six people between 1985 and 1987. Three of them died. These accidents are among the worst in the history of medical accelerators.

This text will first discuss the background of the machine. In a following section some relevant accidents are described. Next, the text will focus on some software issues related to the accidents. In a final section some engineering lessons that can be learned from the accidents will be discussed.

Background

The Therac-25 was a radiation therapy machine produced by (AECL) and CGR MeV. It followed up the Therac-6, capable of producing only, and the Therac-20, a dual-mode (X-rays or electrons) accelerator. Both the machines were based on older CGR machines that already had histories of clinical use. The Therac-6 and Therac-20 had limited software functionality: the computer, a PDP-11, added convenience to the existing hardware. Hardware safety features from even older machines were retained.

In the mid 1970's a new "double pass" concept for electron acceleration was developed. A double pass accelerator folds the long physical mechanism that is required to accelerate electrons. This concept requires less space and is more economical to produce. It was used in the development of the Therac-25, a dual-mode linear accelerator.

To produce the two therapeutic modes (electron mode and photon/X-ray mode), a turntable rotated equipment into the beam. Correct operation of the Therac-25 was dependent on the position of the turntable.

When operating in direct electron mode, a low-powered was emitted directly from the machine, then spread to safe concentration using scanning magnets. These scanning magnets were mounted on the turntable and rotated in position by the computer.

When operating in photon mode, the machine was designed to rotate four components into the path of the electron beam:

  • A target, which converted the electron beam into X-rays
  • A flattening filter, which spread the beam out over a larger area to produce a uniform treatment field
  • A set of movable blocks (also called a ), which shaped the X-ray beam
  • A X-ray ion chamber, which measured the strength of the beam.

A very high input dose was necessary in order to get a acceptable treatment dose out of the flattener. When the flattener was not in the correct position, this resulted in a high output dose to the patient. Obviously, this is a grave hazard when using dual-mode machines.

Instead of using the traditional electromagnetic interlocks to ensure safety at the start of a treatment, in the Therac-25 it was the software's task to check the turntable position. It was decided to use the computer's abilities to control and monitor the hardware. The software was not flawless though: six people received massive overdose, as explained in more detail in the section .

When the software detected an error the machine could shut down in two ways:

  • Treatment suspend: which required a machine reset to restart
  • Treatment pause: which required a single key command to restart the machine. When this occurred the operator merely had to push the P-button to "proceed" and resume treatment without having to reenter treatment data. When this feature was called five times the machine would automatically suspend treatment.

Accidents

The shutdown features, as described above, were to figure in several of the accidents. For example in July 1985, Ontario, the Therac-25 shut down after five seconds. The computer indicated a treatment pause and that no dose was given. So the operator went ahead with a second attempt at treatment by pressing the P-button. This process was repeated four times since the display read NO DOSE each time. After a fifth pause the machine went into treatment suspend. In reality a large overdose was given to the patient. After the treatment the patient complained of a burning sensation to the treatment area. It was estimated that the patient had received between 13000 and 17000 rads. Normal single therapeutic doses are about 200 rads. Doses of 1000 rads can even be fatal, if delivered to the whole body.

After investigation of this accident AECL found some weaknesses and mechanical problems but could not reproduce the malfunction that occurreds. AECL then redesigned some mechanisms and altered the software to tackle these problem. After these improvements AECL claimed that "analysis of the hazard rate of the new solution indicates an improvement over the old system by at least 5 orders of magnitude". The hazard analysis, however, did not seem to include computer failure. More accidents occurred thus.

Another interesting accident happened at the East Texas Cancer Center (ETCC) in March 1986. The operator had lots of experience with the machine and thus could quickly enter prescription data. She wanted to type "e" (for electron mode), but touched the "x"-button (for X-ray) by mistake. To correct this, she used the "up-key" and quickly changed the letter. The other treatment parameters remained. To start treatment she hit the "B-key" (beam on). After a moment the machine shut down and showed the error message: "MALFUNCTION 54", which was not explained nor mentioned in the machine's manual. The machine went into treatment pause which indicated a problem of low priority. The machine showed an underdose, so the operator hit the "P-button" to proceed treatment. Again the machine shut down with a MALFUNCTION 54 error.

The patient complained he had felt something like an electric shock and was immediately examined. The physician however suspected nothing serious. In reality the patient received an immense overdose. Real doses of 16500 to 25000 rads were estimated after the facts. Five months later, the patient died from complications of the overdose.

Three weeks later a similar accident occurred. The same operator noticed an error in the mode and used the "up-key" to correct it. Again the machine showed a MALFUNCTION 54 error. This patient died from overdose three weeks after the accident.

The ETTC physicist immediately took the machine out of service after this second accident and investigated the error on his own. The operator who remembered what she had done, worked with him. With much effort they were able to reproduce the MALFUNCTION 54 error. The key factor in reproducing this error was the speed of entering data: if the data were entered quickly the error occurred.

The same computer bug was present in the Therac-20. But because the Therac-20 had independent hardware protective circuits, this problem was just a nuisance.

Software

The two basic mistakes involved in the accidents are:

  • poor software engineering practices
  • building a machine that relies on software for safe operation

A small part of the software will be studied, yet this can demonstrate the overall design flaws. First the software design is described and then some specific errors believed to be involved in the accidents at the East Texas Cancer Center (ETCC).

General architecture - A was especially written for Therac-25 and ran on a 32K PDP-11/23. Four major components can be distinguished, as listed in following table.

Stored data Critical and non-critical tasks
  • Calibration parameters
  • Patient treatment data


  • Controls sequencing of all non interrupt events
  • Coordinates all concurrent processes


Critical tasks:
  • Treatment monitor (Treat) directs and monitors patient setup and treatment via 8 operating phases. These are called as , depending on the value of the control variable Tphase
  • Others

Non-critical tasks:

  • Checksum (scheduled to run periodically)
  • Calibration


  • Clock interrupt
  • Power up
  • Others


Race conditions as a result of the implementation of multitasking are possible. Following reasons state why:

  • Concurrent access to shared memory is allowed
  • Aside from data stored in shared variables, there is no real synchronization
  • The "test" and "set" for shared variables can not be divided into two single operations.

These race conditions played an important role in the accidents.

Figure 1: Tasks and subroutines responsible for the ETCC accidents (inspired by Nancy Leveson)
Figure 1: Tasks and subroutines responsible for the ETCC accidents (inspired by Nancy Leveson)

Software bugs for ETCC accidents - As illustrated in figure 1, a shared variable (Data Entry Complete) determines the completion of data and is used to communicate between the keyboard handler task and the subroutine Datent (data entry). When this variable is set, the value of Tphase is changed by Datent (see code example below). Next Datent will exit to Treat which will reschedule itself.

The keyboard handler places an encoded result of the mode and energy level into a 2-byte shared variable (MEOS). The low-order byte is used by the task, Hand, to set the turntable in the correct position. Datent uses the high-order byte to set several operating parameters.

The operator is forced to enter mode and energy level by the data entry process. Later the operator can edit the mode and energy level. But if the Data Entry Complete flag is set before the operator changes the data in MEOS, Datent will not detect changes. The turntable, however, is set in accordance to the low-order byte of MEOS by the task Hand and can therefore be inconsistent with the information in the high-order byte.

When the subroutine Datent is entered, it first checks if the mode and energy level in MEOS are set. When this is the case, the high-order byte is used to and fetch all the parameters. After all parameters are set, the subroutine Magnet is called to set the bending magnets of the turntable. Following pseudocode shows relevant parts of the software (taken from the paper of Nancy Leveson, p. 27):

Datent

if mode/energy specified then
  begin calculate table index
    repeat
      fetch parameter
      output parameter
      point to next parameter
    until all parameters set
    call Magnet
    if mode/energy changed then return
  end
if data entry is complete then set Tphase to 3
if data is not complete then
  if reset commando entered then set Tphase to 0
return

Magnet

Set bending magnet flag
repeat
  Set next magnet
  call Ptime
  if mode/energy has changed then exit
until hysteresis delay has expired
Clear bending magnet flag
return

Ptime

repeat
  if bending magnet flag is set then
    if editing taking place then
      if mode/energy has changed then exit
until hysteresis delay has expired
Clear bending magnet flag
return

It takes about 8 seconds to set the magnets. The subroutine Ptime is used to introduce this delay. Ptime is entered several times, because several magnets have to be set. To indicate the bending magnets are being set, a flag is initialized when Magnet is entered. This flag is cleared at the end of Ptime. When an editing request is submitted, the keyboard handler sets a shared variable which is checked by Ptime. If edits are present, Ptime will exit to Magnet, which then exits to Datent. However, this shared variable is only checked when the bending magnet flag is set. But this flag is cleared after the first execution of Ptime. So any edits performed after the first pass through Ptime will not be noticed! In the ETTC accidents the error occurred, since the edits were made within 8 seconds. Hence Datent never detected the changes.

Software engineering lessons

Some lessons can be learned from these accidents.

  • Make sure that checks are (also) done by hardware
    Designers that are not specialized in software often believe that software cannot fail. This is not true! Overreliance on computer functions can lead to dangerous situations. Therefore make sure that (life-)important checks are also done by hardware. In real safety critical systems (e.g., airplanes, nuclear power plants, etc.) the critical software parts are programmed in three independently developed implementations, running in parallel and whose outcomes are compared and voted upon before they are being used in the system.
  • Be aware for inadequate software engineering practices, therefore:
    • Designs should be kept simple.
    • Coding practices should be critically evaluated and discussed by all team members.
    • Error detecting code should be implemented in the software from the beginning. The extra developmente time will be recovered in later testing and deployment.
    • The software should be subjected to extensive testing. Always begin with defining the tests for all parts in the software, before writing any code. Also plan tests of the complete system, since integration errors will show up, even with completely correct components, because the design could contain errors in the interaction specifications between the integrated components.
  • Be careful reusing software
    Very often designers assume that reusing software will increase safety because the software will have been screened and tested extensively. The minimum is to check the conditions under which the reused software has been tested, and to compare it with the conditions of the new system. Nevertheless, only in exceptional cases though, revising and even rewriting the entire software is an option, because every new development will introduce its own errors. So, it is in general better to improve on existing software, instead of rewriting everything from scratch.
  • Weigh the pros and cons of safety and usability
    Making a machine easy to use, may conflict with safety demands. Compare this situation, for example, with the safety hood on a : it ensures safety but interferes with the usability.

References

  • "" Nancy Leveson, University of Washington
  • "" Vasudevan Srinivasan, Gary Halada, and JQ, Department of Materials Science and Engineering, State University of New York at Stony Brook
Retrieved from ""
阅读(883) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~