全部博文(230)
分类:
2008-04-17 12:55:45
A study of the example of the Therac-25 will reveal following design difficulties:
In this case the mistake of reusing and trusting old software was made. This is typically a mistake made in .
A computer-controlled machine, called the Therac-25, severely overdosed six people between 1985 and 1987. Three of them died. These accidents are among the worst in the history of medical accelerators.
This text will first discuss the background of the machine. In a following section some relevant accidents are described. Next, the text will focus on some software issues related to the accidents. In a final section some engineering lessons that can be learned from the accidents will be discussed.
The Therac-25 was a radiation therapy machine produced by (AECL) and CGR MeV. It followed up the Therac-6, capable of producing only, and the Therac-20, a dual-mode (X-rays or electrons) accelerator. Both the machines were based on older CGR machines that already had histories of clinical use. The Therac-6 and Therac-20 had limited software functionality: the computer, a PDP-11, added convenience to the existing hardware. Hardware safety features from even older machines were retained.
In the mid 1970's a new "double pass" concept for electron acceleration was developed. A double pass accelerator folds the long physical mechanism that is required to accelerate electrons. This concept requires less space and is more economical to produce. It was used in the development of the Therac-25, a dual-mode linear accelerator.
To produce the two therapeutic modes (electron mode and photon/X-ray mode), a turntable rotated equipment into the beam. Correct operation of the Therac-25 was dependent on the position of the turntable.
When operating in direct electron mode, a low-powered was emitted directly from the machine, then spread to safe concentration using scanning magnets. These scanning magnets were mounted on the turntable and rotated in position by the computer.
When operating in photon mode, the machine was designed to rotate four components into the path of the electron beam:
A very high input dose was necessary in order to get a acceptable treatment dose out of the flattener. When the flattener was not in the correct position, this resulted in a high output dose to the patient. Obviously, this is a grave hazard when using dual-mode machines.
Instead of using the traditional electromagnetic interlocks to ensure safety at the start of a treatment, in the Therac-25 it was the software's task to check the turntable position. It was decided to use the computer's abilities to control and monitor the hardware. The software was not flawless though: six people received massive overdose, as explained in more detail in the section .
When the software detected an error the machine could shut down in two ways:
The shutdown features, as described above, were to figure in several of the accidents. For example in July 1985, Ontario, the Therac-25 shut down after five seconds. The computer indicated a treatment pause and that no dose was given. So the operator went ahead with a second attempt at treatment by pressing the P-button. This process was repeated four times since the display read NO DOSE each time. After a fifth pause the machine went into treatment suspend. In reality a large overdose was given to the patient. After the treatment the patient complained of a burning sensation to the treatment area. It was estimated that the patient had received between 13000 and 17000 rads. Normal single therapeutic doses are about 200 rads. Doses of 1000 rads can even be fatal, if delivered to the whole body.
After investigation of this accident AECL found some weaknesses and mechanical problems but could not reproduce the malfunction that occurreds. AECL then redesigned some mechanisms and altered the software to tackle these problem. After these improvements AECL claimed that "analysis of the hazard rate of the new solution indicates an improvement over the old system by at least 5 orders of magnitude". The hazard analysis, however, did not seem to include computer failure. More accidents occurred thus.
Another interesting accident happened at the East Texas Cancer Center (ETCC) in March 1986. The operator had lots of experience with the machine and thus could quickly enter prescription data. She wanted to type "e" (for electron mode), but touched the "x"-button (for X-ray) by mistake. To correct this, she used the "up-key" and quickly changed the letter. The other treatment parameters remained. To start treatment she hit the "B-key" (beam on). After a moment the machine shut down and showed the error message: "MALFUNCTION 54", which was not explained nor mentioned in the machine's manual. The machine went into treatment pause which indicated a problem of low priority. The machine showed an underdose, so the operator hit the "P-button" to proceed treatment. Again the machine shut down with a MALFUNCTION 54 error.
The patient complained he had felt something like an electric shock and was immediately examined. The physician however suspected nothing serious. In reality the patient received an immense overdose. Real doses of 16500 to 25000 rads were estimated after the facts. Five months later, the patient died from complications of the overdose.
Three weeks later a similar accident occurred. The same operator noticed an error in the mode and used the "up-key" to correct it. Again the machine showed a MALFUNCTION 54 error. This patient died from overdose three weeks after the accident.
The ETTC physicist immediately took the machine out of service after this second accident and investigated the error on his own. The operator who remembered what she had done, worked with him. With much effort they were able to reproduce the MALFUNCTION 54 error. The key factor in reproducing this error was the speed of entering data: if the data were entered quickly the error occurred.
The same computer bug was present in the Therac-20. But because the Therac-20 had independent hardware protective circuits, this problem was just a nuisance.
The two basic mistakes involved in the accidents are:
A small part of the software will be studied, yet this can demonstrate the overall design flaws. First the software design is described and then some specific errors believed to be involved in the accidents at the East Texas Cancer Center (ETCC).
General architecture - A was especially written for Therac-25 and ran on a 32K PDP-11/23. Four major components can be distinguished, as listed in following table.
Stored data | Critical and non-critical tasks | ||
---|---|---|---|
|
|
Critical tasks:
Non-critical tasks:
|
|
Race conditions as a result of the implementation of multitasking are possible. Following reasons state why:
These race conditions played an important role in the accidents.
Software bugs for ETCC accidents - As illustrated in figure 1, a shared variable (Data Entry Complete) determines the completion of data and is used to communicate between the keyboard handler task and the subroutine Datent (data entry). When this variable is set, the value of Tphase is changed by Datent (see code example below). Next Datent will exit to Treat which will reschedule itself.
The keyboard handler places an encoded result of the mode and energy level into a 2-byte shared variable (MEOS). The low-order byte is used by the task, Hand, to set the turntable in the correct position. Datent uses the high-order byte to set several operating parameters.
The operator is forced to enter mode and energy level by the data entry process. Later the operator can edit the mode and energy level. But if the Data Entry Complete flag is set before the operator changes the data in MEOS, Datent will not detect changes. The turntable, however, is set in accordance to the low-order byte of MEOS by the task Hand and can therefore be inconsistent with the information in the high-order byte.
When the subroutine Datent is entered, it first checks if the mode and energy level in MEOS are set. When this is the case, the high-order byte is used to and fetch all the parameters. After all parameters are set, the subroutine Magnet is called to set the bending magnets of the turntable. Following pseudocode shows relevant parts of the software (taken from the paper of Nancy Leveson, p. 27):
Datent
if mode/energy specified then begin calculate table index repeat fetch parameter output parameter point to next parameter until all parameters set call Magnet if mode/energy changed then return end if data entry is complete then set Tphase to 3 if data is not complete then if reset commando entered then set Tphase to 0 return
Magnet
Set bending magnet flag repeat Set next magnet call Ptime if mode/energy has changed then exit until hysteresis delay has expired Clear bending magnet flag return
Ptime
repeat if bending magnet flag is set then if editing taking place then if mode/energy has changed then exit until hysteresis delay has expired Clear bending magnet flag return
It takes about 8 seconds to set the magnets. The subroutine Ptime is used to introduce this delay. Ptime is entered several times, because several magnets have to be set. To indicate the bending magnets are being set, a flag is initialized when Magnet is entered. This flag is cleared at the end of Ptime. When an editing request is submitted, the keyboard handler sets a shared variable which is checked by Ptime. If edits are present, Ptime will exit to Magnet, which then exits to Datent. However, this shared variable is only checked when the bending magnet flag is set. But this flag is cleared after the first execution of Ptime. So any edits performed after the first pass through Ptime will not be noticed! In the ETTC accidents the error occurred, since the edits were made within 8 seconds. Hence Datent never detected the changes.
Some lessons can be learned from these accidents.