(转)Failure modes and prevention-bilbo0214-ChinaUnix博客

好好学习

首页　| 　博文目录　| 　关于我

bilbo0214

博客访问： 1689481
博文数量： 230
博客积分： 10045
博客等级：上将
技术积分： 3357
用户组：普通用户
注册时间： 2006-12-30 20:40

文章分类

全部博文（230）

Plotting（4）
Literate Progra（2）
Ada（3）
行业窥探（2）

核电行业（0）

风电行业（1）

铁路行业（1）
Formal methods（7）
Networking（7）
Caml（12）
读书时间（16）
FieldBus（6）
GSM/GSM-R（4）
Good Resource（8）
Visual Language（0）
Software Enginee（13）
Compiler Enginee（2）
Safety System（34）
Tcl & Expect（1）
AutoHotKey（3）
其他编程语言（11）
C/C++（24）
胡言乱语（40）
技术幽默（2）
Hardware Desing（13）
Embedded System（4）
Perl（10）
Unix_Linux（0）
关注社会（2）
未分配的博文（0）

文章存档

2011年（7）

2010年（35）

2009年（62）

2008年（126）

我的朋友

Introduction

Technological failure modes in embedded systems can be divided into two main groups: hardware failure modes and software failure modes; the toughest failures to precent however, are those caused by subtle interactions between hardware and software. Some examples of software failure modes are:

: the computer memory is smaller than the programmer expected, so during operation of the embedded system, one of the programs in the system is accessing wrong parts of the computer's memory.
: this error is common in in which the human programmer is responsible for making sure that every pointer points to the right memory location at all times.
Resource leaks in which programming errors lead to the loss of computer control over some of the hardware resources; are the simplest form of resource leak.
in which specific relative timing events of different components of the system leads to unexpected behaviour. Such race conditions are often hard to detect by testing only.

Some examples of hardware failure modes:

Electrical failure: short-circuiting, too high voltage/current
Mechanical failure: jamming of a valve
Temperature effects: deformation of components
Material failure: corrosion

It is important to note again that these examples are only consequences and not causes!! Examples of software failure causes are:

Too small memory
Noise
Shared interfaces with other systems

Examples of hardware failure causes:

Badly calibrated sensors
Choosing the wrong dimensions
Manufacturing/assembly process deficiencies

To detect failures in the design process it is important to perform different tests on the system (espescially on the software). But tests are expensive and they should provide the correct information: the importance of test results depend on the quality of the test. So it is not always easy to come up with an appropriate test. Such testing is called in the software world. An example of dynamic analysis on hardware could be vibration and stress analysis.

These days engineers have developped for software, which is test-free: no specific tests need to be developped, the software can be checked for flaws without having to execute the program. The can be considered as an example of ‘static analysis’ on hardware.

There are a number of possibilities to reduce the chance of failure occurences. But some failures need to be treated more urgent than others. At first one should look at the frequency with which a systems fails, this is called the of a system. It is desired that systems don’t fail, but if a failure is very rare it is often not necessary to take steps.

An other aspect of a failure mode is it’s severity. An electrical appliance that short-ciruits can be life threathening, whereas the jamming of a valve in vending machine is less life threatening.

Due to the increasing capabilities and functionality of embedded systems it is difficult to prevent or sometimes even detect failure modes. One way to ensure the reliability is extensive testing, as mentioned above and techniques such as . One of the problems with these techniques is that they are only used in the late stage of development. Therefore it is beter to design(!) quality and reliability in, in the early stages of development.

Despite all the effort an engineer can put into designing a system that doesn’t fail, failures will always occur. For example an average cell phone these days contains as much as 2 million lines of software code. It is very likely that in one of those lines a fault is introduced. Also systems are getting even more complex. For instance: that same cell phone is expected to have as much as 10 million lines of code in 10 years. Therefore it is better to make a design more robust. When the systems detects something goes wrong it can signal this and go into a until the user takes appropriate actions. Take for example again the jamming of a valve of the vending machine: the machine can light all it’s leds to signal something is wrong and cease providing soda until it is repared.

Failures are also to be expected when different seperate systems have to work together: for instance the different robots in robocup. An other example of such a complex system are the robots of professor James McLurkin of MIT who have to perform the starwars theme tune together, but every robot can only play some notes. So they have to cooperate in order to play the entire theme correct.

This all stresses how important it is to rule out failures in the design process. Fortunately engineers have developped some procedures to do so systematically.

Failure prevention

Safety factors

are often used to ensure that a design will work, and to protect it against failures. But large safety factors don’t always give rise to a reliable system. Often they lead to overdesigned systems, which are more expensive and can take a larger time to manufacture/assemble.

Failure mode and effects analysis

In order to reduce(or beter prevent) the failure chance of a system, engineers have developed a technique called “” (FMEA). This is a tool to identify potential or actual points of failure in a system, product or manufacturing/assembly operation and choose the proper corrective action, when designing. FMEA provides an analytical approach to determine which risk has the greatest concern, and therefore an action is needed to prevent a problem before it arises. The development of these specifications will ensure a system will meet the defined requirements.

It is also possible to identify critical or important design/process characteristics that require special controls to prevent or detect failure modes. A crucial step is anticipating what might go wrong with a product. While anticipating every failure mode is not possible, a development team should formulate an extensive list of potential failure modes as possible. FMEA starts at the begin of a design, and is maintained and adapted through the entire design proces. This way it is possible to design out failures. This way FMEA also contains important information for use in future system improvements

Using FMEA when designing

The process for conducting an FMEA is straightforward. It is developed in 3 main phases, in which appropriate actions need to be definied. But before starting with a FMEA, it is important to do some pre-work to make sure the robustness and past history are in included in the analysis. It is important to consider both intentional and unintentional uses! Unintentional uses are a form of hostile environment.

Step 1: Severity

Determine all failure modes based on the functional requirements and their effects. Examples of failure modes are: Electrical short-circuiting,corrosion or deformation. It is important to note that a failure mode in one component can lead to a failure mode in another component. Hereafter the ultimate effect of each failure mode needs to be considered. A failure effect is definied as the result of a failure mode on the function of the system as perceived by the user. In this way it is convenient to write these effects down in terms of what the user might see or experience. Examples of failure effects are: degraded preformance, noise or even injury to a user.

Each effect is given a severity number(SEV) from 1(no danger) to 10(important). These numbers help an engineer to prioritize. If the severity of an effect has a number 9 or 10, actions are considered to change the design by eliminiating the failure mode, if possible, or protecting the user from the effect.

Step 2: Occurence

In this step it is necessary to look at the cause of a failure and how many times it occurs. Examples of causes are: erroneus algorithems, excessice voltage or improper operating conditions. A failure mode is given a probability number(OCCUR),again 1-10. Actions need to be determined if the occurunce is high (meaning >4 for non safety failure modes and >1 when the severity-number from step 1 is 9 or 10).

Step 3: Detection

When appropriate actions are determined, it is necessary to test their efficiency. Also a design verification is needed.The proper inspection methods needs to be chosen. Each combination from the previous 2 steps, recieves a detection number(DETEC). This number represents the ability of planned tests and inspections at removing defects or detecting failure modes.

After these 3 basic steps, Risk Priority Numbers (RPN) are calculated.

Risk Priority Numbers

RPN do not play an important part in the choice of an action against failure modes. They are more treshold values in the evaluation of these actions.

After ranking the severity, occurence and detectability the RPN can be easily calculated by multiplying these 3 numbers:

$R P N = S E V x O C C U R x D E T E C$

This has to be done for the entire proces and/or design. Once this it is done it is easy to determine the areas of greatest concern. The failure modes that have the highest RPN should be given the highest priority for corrective action. This means it are not always the failure modes with the highest SEV-numbers that should be treated first. There could also be less severe failures, but who occur more often and are less detectable.

After these values are allocated, recommended actions with targets, responsibility and dates of implementation are noted. These actions can include specific inspection, testing or quality procedures, redesign (such as selection of new components), adding more redundancy and limiting environmental stresses or operating range. Once the actions have been implemented in the design/process, the new RPN should be checked, to confirm the improvements. These tests are often put in graphs, for easy visualisation. Whenever a design or a process changes, an FMEA should be updated.

A few logical but important thoughts come to mind:

Try to eliminiate the failure mode (some failures are more preventable than others)
Minimize the severity of the failure
Reduce the occurence of the failure mode
Improve the detection (!!!)

Anticipatory Failure Determination

Like FMEA, (AFD) has the objective of identifying and preventing possible failures. The approach of AFD however is just the inverse of that of FMEA. Rather dan searching for causes of failure modes, AFD asks developers to view at the failure of interest as an intended consequence and to look for ways to make sure that this failure always happens reliably.

AFD is more suited for complex failure analysis than FMEA. FMEA relies on the identification of failures and their causes based on application or personal experience of others. However the problem with this approach is “the denial phenomenon”. If one tries to consider what can go wrong with a functioning system, there is the tendancy to resist thinking about unpleasant possibilites that migth occur, unless they actually have been experienced before. By reversing the problem AFD overcomes this “denial phenomenon” and opens up creative insights into analysis of failures.

AFD-process

Step 1: Formulation or invertion of the problem

In stead of thinking about possible causes for a failure, an engineer should think about how to make that failure happen, under the conditions that make this failure happen. First identification of these conditions is needed. After that one should think about the scenario that gives rize to the failure and try to localize it.

Step 2: Search for solutions or methods to produce the failure

The thought process is now shifted to finding the mechanism or means to produce the examined failure. Function analysis can ben useful to identify a series of functions or actions involved in the failure scenario.

Step 3: Verify that rescources are available to cause the failures

There are seven potential categories of resources: substances, field effects, space available, time, object structure, system functions and other data on the system. For each of the potential solutions to cause a failure, it is necessary to check if the requierd resources are available to suppurt this solution.

Fault tree Analysis

(FTA) is a third form of failure analysis in which an undesired state of a system is analyzed using to combine a series of lower-level events.

Conclusion

All these approaches can be used in what is called . The studie of failures is an important aspect of designing an embedded control system as it safes time, money and helps with eventual future modification of a system.

References

阅读(1091) | 评论(1) | 转发(0) |

上一篇：(转)Learning from failure

下一篇：(转)Modems and AT Command

给主人留下些什么吧！~~

shihongyuan2008-05-03 00:41:37

你好！我们正在做一个安全系统开发，属于高速铁路控制系统。有兴趣可以联系一下: shihongyuan@hollysys.com

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6