Jump to content

User:Amber-project.eu/Sandbox

From Wikipedia, the free encyclopedia

In computing, fault injection is an empirical technique for the assessment and verification of fault-handling mechanisms. It allows system designers and researchers to study how computer systems react and behave in the presence of faults, and is widely considered important for the development of dependable computer systems.[1][2]

Computer systems are often affected by faults and contain numerous mechanisms to handle them. Fault injection is a way of testing such mechanisms, by introducing artificial faults and errors (intending to mimic real faults and errors as closely as possible) in order to activate fault-handling components.

Uses[edit]

Fault injection is used in many contexts and can serve different purposes, such as:

  • Assess the effectiveness, i.e., fault coverage, of software and hardware implemented fault-handling mechanisms;
  • Study error propagation and error latency in order to guide the design of fault-handling mechanisms;
  • Test the correctness of fault-handling mechanisms;
  • Measure the time it takes for a system to detect or to recover from errors;
  • Test the correctness of fault-handling protocols in distributed systems;
  • Verify failure mode assumptions for components or subsystems.

Techniques[edit]

Fault injection can in principle be carried out in two ways: faults can be injected either in a real system or in a model of a system. A real system is a physical computer system, which may either be a prototype or a commercial product. System models for fault injection experiments can be built using two basic techniques: software simulation and hardware emulation. Hence, there are three main approaches to fault injection:

  • Injection of faults into real systems,
  • Software simulation-based fault injection, and
  • Hardware emulation-based fault injection.

There are several techniques that can be used for injection of faults into real systems. These can be divided into three main categories: hardware-implemented fault injection, software-implemented fault injection, and radiation-based fault injection.

Software simulation-based fault injection can be performed in simulators operating at different levels of abstraction, such as device level, gate level, functional block level, instruction set architecture (ISA) level, and system level.

Hardware-emulation based fault injection uses models of hardware circuits implemented in large Field-Programmable Gate Array (FPGA) circuits. These models can provide a highly detailed, almost perfect hardware representation of the system that is being verified or assessed.

Properties[edit]

Fault injection techniques can be compared and characterized on the basis of several different properties. The following properties are applicable to all types of fault injection techniques:

  • Controllability – ability to control the injection of faults in time and space.
  • Observability – ability to observe and record the effects of an injected fault.
  • Repeatability – ability to repeat a fault injection experiment and obtain the same result.
  • Reproducibility – ability to reproduce the results of a fault injection campaign.
  • Reachability – ability to reach possible fault locations inside an integrated circuit, or within a program.
  • Fault representativeness – how accurately the faultload represents real faults.
  • Workload representativeness – how accurately the workload represents real system usage.
  • System representativeness – how accurately the target system represents the real system

Hardware implemented fault injection[edit]

Hardware-implemented fault injection includes three techniques: pin-level fault injection, power supply disturbances, and test port-based fault injection.

In pin-level fault injection, faults are injected via probes connected to electrical contacts of integrated circuits or discrete hardware components. This method was used already in the 1950s for generating fault dictionaries for system diagnosis. Many experiments and studies using pin-level fault injection were carried out during the 1980s and early 1990s. Several pin-level fault injection tools were developed at that time, for example, MESSALINE[3] and RIFLE[4]. A key feature of these tools was that they supported fully automated fault injection campaigns. The increasing level of integration of electronic circuits has rendered the pin-level technique obsolete as a general method for evaluating fault-handling mechanisms in computer systems. The method is, however, still valid for assessment of systems where faults in electrical connectors pose major problem, such as automotive and industrial embedded systems.

Power supply disturbances (PSDs) are rarely used for fault injection because of low repeatability. They have been used mainly as a complement to other fault injection techniques in the assessment of error detection mechanisms for small microprocessors.[5][6][7] The impact of PSDs is usually much more severe than the impact of other commonly used injection techniques, e.g., those that inject single bit-flips, since PSDs tend affect many bits and thereby a larger part of the system state. Interestingly, some error detection mechanisms show lower fault coverage for PSDs than for single bit-flip errors.

Test port-based fault injection encompasses techniques that use test ports to inject faults in microprocessors. Many modern microprocessors are equipped with built-in debugging and testing features, which can be accessed through special I/O-ports, known as test access ports (TAPs), or just test ports. Test ports are defined by standards such as the IEEE-ISTO 5001-2003 (Nexus) standard[8] for real-time debugging, the IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture (JTAG)[9], and the Background Debug Mode (BDM) facility. Nexus and JTAG are standardized solutions used by several semiconductor manufacturers, while BDM is a proprietary solution for debugging developed by Freescale. Tools for test port-based fault injection are usually implemented on top of an existing commercial microprocessor debug tool, since such tools contain all functions and drivers that are needed to access a test port.

The type of faults that can be injected via a test port depends on the debugging and testing features supported by the target microprocessor. Normally, faults can be injected in all registers in the instruction set architecture (ISA) of the microprocessor. BDM and Nexus also allows injection of faults in main memory. Test ports could also be used to access hardware structures in the microarchitecture that are invisible to the programmer. However, information on how to access such hardware structures is usually not disclosed by manufacturers of microprocessors. Tools that support test port-based fault injection include GOOFI[10] and INERTE[11]. GOOFI supports both JTAG-based and Nexus-based fault injection, while INERTE is specifically designed for Nexus-based fault injection. There is also an environment for BDM-based fault injection.[12]

Injecting a fault via a test port involves four major steps: i) setting a breakpoint via the test port and waiting for the program to reach the breakpoint, ii) reading the value of the target location (a register or memory word) via the test port, iii) manipulating this value and then writing the new, faulty value back to the target location, and iv) resuming the program execution via a command sent to the test port. The time overhead for injecting a fault depends on the speed of the test port. JTAG and BDM are low-speed ports, whereas Nexus ports can be of four different classes with different speeds. The simplest Nexus port (Class 1) is a JTAG port, which uses serial communication and therefore only needs 4 pins. Ports compliant with Nexus Class 2, 3 or 4 use separate input and output ports, know as auxiliary ports. These are parallel ports that use several pins for data transfer. The actual number of data pins is not fixed by the Nexus standard, but for class 3 and 4 ports the standard recommends 4 to 16 data pins for the auxiliary output port and 1 to 4 data pins for the auxiliary input port.

The main advantage of test port-based fault injection is that faults can be injected internally in microprocessors without making any alterations of the system’s hardware or software. Compared to software-implemented fault injection, it provides better or equal capabilities of emulating real hardware faults. Finally, advanced Nexus ports (Class 3 and 4) provides outstanding possibilities for data collection and observing the impact of injected faults within a microprocessor. Existing tools have not fully exploited these possibilities. Hence, microprocessors with high-speed Nexus ports constitute interesting targets for the development of new fault injection tools, which potentially can achieve much better observability than existing tools do.

Software-implemented fault injection of hardware faults[edit]

Software-implemented fault injection encompasses techniques that inject faults through software executed on the target system.

There are basically two approaches that we can use to emulate hardware faults by software: run-time injection and pre run-time injection. In run-time injection, faults are injected while the target system executes a workload. This requires a mechanism that i) stops the execution of the workload, ii) invokes a fault injection routine, and iii) restarts the workload. Thus, run-time injection incurs a significant run-time overhead. In pre run-time injection, faults are introduced by manipulating either the source code or the binary image of the workload before it is loaded into memory. Pre run-time injection usually incurs less run-time overhead than run-time injection, but the total time for conducting a fault injection campaign is usually longer for pre run-time injection since it needs more time for preparing each fault injection experiment.

There are several fault injection tools that can emulate the effects of hardware faults by software, but they use different techniques for injecting faults and support different fault models. Most of these tools use run-time injection, since it provides better opportunities for emulating hardware faults than pre run-time injection. Software-implemented fault injection relies on the assumption that the effects of real hardware faults can be emulated either by manipulating the state of the target system via run-time injection, or by modifying the target workload through pre run-time injection.

The validity of this approach varies depending of the fault type and where the fault occurs. Consider for example emulation of a soft error, i.e., a bit-flip error induced by a strike of a high energy particle. Flipping bits in main memory or processor registers can easily be done by software. On the other hand, the effect of a bit-flip in a processor’s internal control logic can be difficult, if not impossible, to emulate accurately by software manipulations.

Emulating a permanent hardware fault requires a more elaborate set of manipulations than emulating a transient fault. For example, the emulation of a stuck-at fault in a memory word or a processor register would require a sequence of manipulations performed every time the designated word or register is read by a machine instruction. On the other hand, a transient fault requires only a single manipulation. The time overhead imposed by fault emulation thus varies for different fault types.

We here describe seven tools that are capable of emulating hardware faults through software. These tools represent important steps in the development of software-implemented fault injection for emulation of hardware faults. The tools are FIAT[13], FERRARI[14], FINE[15], DEFINE[16], FTAPE[17], DOCTOR[18], and Xception[19]. These tools use different approaches to emulating hardware faults and implement partly different fault models. Some of the tools also provide support for emulating software faults.

One of the first tools that used software to emulate hardware faults was FIAT, developed at Carnegie Mellon University. FIAT injected faults by corrupting either the code or the data area of a program’s memory image during run-time. Three fault types were supported: zero-a-byte, set-a-byte and two-bit compensation. The last fault type involved complementing any 2 bits in a 32 bit word. Injection of single-bit errors was not considered, because the memory of the target system was protected by parity. More advanced techniques for emulation of hardware faults were included in FERRARI, developed at the University of Texas, and in FINE, developed at the University of Illinois. Both these tools supported emulation of transient and permanent hardware faults in systems based on SPARC processors from Sun Microsystems. FERRARI could emulate three types of faults: address line, data line, and condition code faults, while FINE emulated faults in main memory, CPU-registers and the memory bus. DEFINE, which was an extension of FINE, supported fault injection in distributed systems and introduced two new fault models for intermittent faults and communication faults.

DOCTOR is a fault injection tool developed at the University of Michigan targeting distributed real-time systems. It supports three fault types: memory faults, CPU faults and communication faults. The memory faults can affect a single-bit, two bits, one byte, and multiple bytes. The target bit(s)/byte(s) can be set, reset and toggled. The CPU faults emulate faults in processor registers, the op-code decoding unit, and the arithmetic logic unit. The communication faults can cause messages to be lost, altered, duplicated or delayed. DOCTOR can inject transient, intermittent and permanent faults, and uses run-time injection for the transient and intermittent faults. Permanent faults are emulated using pre run-time injection.

FTAPE is a fault injector aimed at benchmarking of fault tolerant commercial systems. It was used to assess and test several prototypes of fault tolerant computers for on-line transaction processing. FTAPE emulates the effects of hardware faults in the CPU, main memory and I/O units. The CPU faults include single and multiple bit-flips and zero/set registers in CPU registers. The memory faults include single and multiple bit and zero/set faults in main memory. The I/O faults include SCSI and disk faults. FTAPE was developed at the University of Illinois in cooperation with Tandem Computers.

Xception is a fault injection tool developed at the University of Coimbra. This tool uses the debugging and performance monitoring features available in advanced microprocessors to inject faults. Thus it injects faults in a way which is similar to test port-based fault injection. The difference is that Xception controls the setting of breakpoints and performs the fault injections via software executed on the target processor rather than sending commands to a test port.

Xception injects faults through exception handlers executing in kernel mode, which can be triggered by the following events: op-code fetch form a specified address, operand load from a specified address, operand store to a specified address, and a specified time passed since start-up. These triggers can be used to inject both permanent and transient faults. Xception can emulate hardware faults in various functional units of the processors such as the integer unit, floating point unit and the address bus. It can also emulate memory faults, including stuck-at-zero, stuck-at-one and bit-flip faults. Xception is unique because it is the only tool mentioned in this section that has been developed into a commercial tool. The Xception tool is still sold by Critical Software, Coimbra, which released the first commercial version of the tool in 1999.

Radiation-based fault injection[edit]

Modern electronic integrated circuits and systems are sensitive to various forms of external disturbances such electromagnetic inference and particle radiation. One way of validating a fault tolerant system is thus to expose the system to such disturbances.

Although computer systems often are used in environments where they can be subjected to electromagnetic interference (EMI), it is not common to use such disturbances to validate fault tolerance mechanisms. The main reason for this is that EMI injections are difficult to control and repeat. EMI has been used along with three other fault injection techniques to evaluate error detection mechanisms in a computer node in a distributed real time system.[20] A primary goal of this study was to compare the impact of pin-level fault injection, EMI, heavy-ion radiation and software-implemented fault injection. The study showed that the EMI injections tended to “favour” one particular error detection mechanisms. For some of the fault injection campaigns almost all faults were detected by one specific CPU-implemented error detection mechanism, namely spurious interrupt detection. This illustrates the difficulty in using EMI as a fault injection method.

A growing reliability concern for computer systems is the increasing susceptibility of integrated circuits to soft errors, i.e., bit-flips caused when highly ionizing particles hits sensitive regions within in a circuit. Soft errors have been a concern for electronics used in space applications since the 1970s. In space, soft errors are caused by cosmic rays, i.e., highly energetic heavy-ion particles. Heavy-ions are not a direct threat to electronics at ground-level and airplane flight altitudes, because they are absorbed when they interact with Earth’s atmosphere. However, recent circuit technology generations have become increasingly sensitive to high energy neutrons, which are generated in the upper atmosphere when cosmic rays interact with the atmospheric gases. Such high energy neutrons are a major source of soft errors in ground-based and aviation applications using modern integrated circuits. All modern microprocessors manufactured in technologies with feature sizes below 90 nm are therefore equipped with fault tolerance mechanisms to cope with soft errors. To assess the efficiency of such fault tolerance mechanisms, semiconductor manufactures are now regularly testing their circuits by exposing them to ionising particles. In such tests, it is common to use proton radiation produced by a particle accelerator. The IBM POWER 6 processor has been subjected to proton testing.[21]

The sensitivity of integrated circuits to heavy-ion radiation can be exploited for assessing the efficiency of fault-handling mechanisms. This fact led to several fault injection experiments conducted by exposing circuits to heavy-ion radiation from a Californium-252 source.[22][23]

Radiation-based fault injection has very low, or non-existent, repeatability. Due to low controllability, it is not possible to precisely synchronize the activity of the target system with the time and the location of an injection in radiation-based fault injection. Thus it is not possible to repeat an individual experiment. However, the ability to statistically reproduce results over many fault injection campaigns is usually high in particle radiation experiments. Both repeatability and reproducibility are low for EMI-based fault injection.

Simulation-based fault injection[edit]

Simulation-based fault injection can be performed at different levels of abstraction, such as the device level, logical level, functional block level, instruction set architecture (ISA) level, and system level. Simulation models at different abstractions layers are often combined in so called mix-mode simulations to overcome limitations imposed by the time overhead incurred by detailed simulations.

FOCUS [24] is an example of a simulation environment that combines device-level and gate-level simulation for fault sensitivity analysis of circuit designs with respect to soft errors. At the logic level and the functional block level, circuits are usually described in a hardware description language (HDL) such as VHDL or Verilog. Several tools have been developed that support automated fault injection experiments with HDL- models, e.g., MEFISTO[25] and the tool described by Delong[26].

Several studies aimed at assessing the soft error vulnerability of complex high-performance processors have been conducted using simulation-based fault injection. Wang proposed a a novel low-cost approach for tolerating soft errors in the execution core of a high-performance processor is evaluated by combining simulations in a detailed Verilog model with an ISA-level simulator [27]. This approach allowed the authors to study the impact soft errors for seven SPEC2000 integer benchmarks through simulation.

DEPEND [28] is a tool for simulation-based fault injection at the functional level aimed at evaluating architectures of fault-tolerant computers. A simulation model in DEPEND consists of number of interconnected modules, or components, such as CPUs, communication channels, disks, software systems, and memory. DEPEND is intended for validating system architectures in early design phases and serves a complement to probabilistic modelling techniques such as Markov and Petri net models. DEPEND provides the user with predefined components and fault models, but also allows the user to create new components and new fault models, e.g., the user can use any probability distribution for the time to failure for a component.

Hardware emulation-based fault injection[edit]

The advent of large Field Programmable Gate Arrays (FPGAs) circuits has provided new opportunities for conducting model-based fault injection with hardware circuits. Circuits designed in a hardware description language (HDL) are usually tested and verified using software simulation. Even if a powerful computer is used in such simulations, it may take considerable time to verify and test a complex circuit adequately. To speed up the test and verification process, techniques have been developed where HDL-designs are tested by hardware emulation in a large FPGA circuit. This technique also provides excellent opportunities for conducting fault injection experiments. Hardware emulation-based fault injection has all the advantages of simulation-based fault injection such as high controllability and high repeatability, but requires less time for conducting a fault injection experiment compared to using software simulation. The use of hardware emulation for studying the impact of faults was first proposed by Kwang-Ting [29] for fault simulation, and assessing the fault coverage of test patterns used in production testing.

Fault injection can be performed in hardware emulation models through compile time reconfiguration and run-time reconfiguration. Here reconfiguration refers to the process of adding hardware structures to the model which are necessary to perform the experiments. In compile-time reconfiguration, these hardware structures are added by instrumentation of the HDL models. An approach for compile-time instrumentation for injection of single event upsets (soft errors) is described by Civera[30]. This work presents different instrumentation techniques that allow injection of transient faults in sequential memory element as well as in microprocessor-based systems.

One disadvantage of compile-time reconfiguration is that the circuit must be re-synthesised for each reconfiguration, which can impose a severe overhead on the time it takes to conduct a fault injection campaign. One technique to avoid re-synthesizing the target circuit, and save time through run-time reconfiguration consists on directly modifying the bit-stream that is used to program the FPGA-circuit[31]. By exploiting partial reconfiguration capabilities available in some FPGA circuits, this technique achieved substantial time-savings compared to other emulation-based approaches to fault injection. FADES[32] is another tool for conducting hardware emulation-based fault injection. This tool uses run-time configuration and can inject several different types of transient faults, including bit-flips, pulse, and delay faults, as wells as faults that cause digital signals to assume voltage levels between “1” and “0”.

Hybrid approaches for injecting hardware faults[edit]

Hybrid approaches to fault injection combine several fault injection techniques to improve the accuracy and scope of the verification, or the assessment, of a target system.

In one approach for combining software-implemented emulation of hardware faults and simulation-based fault injection, the physical target is run until the program execution hits a fault injection trigger, which causes the physical system to halt. The architected state of the physical system is then transferred to the simulation model, in which a fault is injected, e.g., in the non-visible parts of the microarchitecture. The simulator is run until the effects of the fault have stabilized in the architected state of the simulated processor. This state is then transferred back to the physical system, which subsequently is restarted so that the system-level effects of the fault can be determinedCite error: A <ref> tag is missing the closing </ref> (see the help page).. The hardware fault injector can inject logic-0/logic-1 faults into the memory bus lines of a SPARC 1 based workstation. The authors used the hardware fault injector to study the sensitivity of the computer in different operational modes. The results showed that system was more likely to crash from bus faults when the processor operated in kernel mode, compared to when it operated in user mode.

NFTAPE[33], developed at the University of Illinois, is a more recent tool that supports the use of different fault injections techniques. This tool is aimed at injecting faults in distributed systems using a technique called LightWeight Fault Injectors (LWFI). The purpose of the LWFI is to separate the implementation of the fault injector from the rest of the tool. NFTAPE provides a standardized interface for the LWFIs, which simplifies the integration and use of different types of fault injectors. NFTAPE has been used with several types of fault injectors using hardware-implemented, software-implemented, and simulation-based fault injection.

Techniques for injecting or emulating software faults[edit]

Software faults are currently the dominating source of computer system failures. Making computer systems resilient to software faults is therefore highly desirable in many application domains. Much effort has been invested by both academia and industry in the development of techniques that can tolerate and handle software faults. In this context, fault injection plays an important role in assessing the efficiency of these techniques. Hence, several attempts have been made to develop fault injection techniques that can accurately imitate the impact of real software faults.

The current state-of-the-art techniques in this area rely exclusively on software-implemented fault injection. There are two fundamental approaches to injecting software faults into a computer system: fault injection and error injection[34]. Fault injection imitates mistakes of programmers by changing the code executed by the target system, while error injection attempts to emulate the consequences of software faults by manipulating the state of the target system.

Regardless of the injection technique, the main challenge is to find fault sets or error sets that are representative of real software faults. Other important challenges include the development of methods that allow software faults to be injected without access to the source code, and techniques for reducing the time it takes to perform an injection campaign.

Emulating software faults by error injection[edit]

There are two common techniques for emulating software faults by error injection: program state manipulation and parameter corruption. Program state manipulation involves changing variables, pointers and other data stored in main memory or CPU-registers. Parameter corruption corresponds to modifying parameters of functions, procedures and system calls. The latter is also known as API parameter corruption and falls under category of robustness testing. Next are presented techniques for emulating software faults by program state manipulation.

Many of the tools described in conjunction with emulation of hardware faults through software-implemented fault injection, e.g., FIAT[35], FERRARI[36], FTAPE [37], DOCTOR[38] and Xception[39], can potentially be used to emulate software faults since they are designed to manipulate the system state. However, none of these tools provide explicit support for defining errors that can emulate software faults and the representativeness of the injected faults may be questioned.

Christmansson[40] proposed an approach for generating representative error sets that emulates real software faults based on a study of software faults encountered in one release of a large IBM operating systems product..That study addressed four important questions related to emulation of software fault by error injection: what error model(s) should be used; where should errors be injected; when should errors be injected; and how should a representative operational profile (workload) be designed. This work shows the feasibility of generating representative error sets when data on software faults is available.

The same authors presented another study comparing fault with error injection[41]. Fault and error injection experiments were carried on a safety-critical real-time control application. A total of 200 assignment, checking and interface faults were injected by mutating the source code, which was written in C. The failure symptoms produced by these faults were compared with failure symptoms produced by bit flip errors injected in processor registers, and in the data and stack areas of the main memory. A total 675 errors were injected. A comparison of the failure distributions were made for eight different workload activations (test cases). This study offers the conclusion that the choice of test case caused greater variations in the distribution of the failure symptoms than the choice of fault type, when fault injection was used. On the other hand, for error injection the choice of error type caused greater variations in the failure distribution than the choice of test case. There were also significant differences between the failure distributions obtain with fault injection and with error injection. The authors claim that these differences occurred because a time-based trigger was used to control the error injections. They also claim that the fault types considered could be emulated more or less perfectly by using a break-point based trigger, although no experimental evidence is presented to support this claim. This study points out that it may be difficult to find error sets that emulate software faults accurately, and that the selection of the test case (workload activation) is as important as the selection of the fault/error model for the outcome of an injection campaign.

Techniques for injection of software faults[edit]

An obvious way to inject software faults into a system is to manipulate the source code, object code or machine code of the target. Such manipulations are known as mutations. Mutations have been used extensively in the area of program testing as a method for evaluating the effectiveness of different test methods. They have also been used for the assessment of fault-handling mechanisms.

Ng [42] [43] injected software faults in an operating system through simple mutation of the object code. The primary goal of the fault models used in these studies was to generate a wide variety of operating system crashes, rather than achieving a high degree of representativeness with respect to real soft faults.

FINE [44] and DEFINE [45] were among the first tools that supported emulation of software faults by mutations . The mutation technique used by these tools requires access to assembly language listings of the target program. FINE and DEFINE emulates four types of software faults: initializations, assignment, condition check, and function faults. These fault models were defined based on experience collected from studies of field failure data.

A technique, called Generic Software Fault Injection Technique (G-SWFIT) [46] emulates software faults by mutations at the machine-code level. This technique analyses the machine code to identify locations that corresponds to high-level language constructs that often results in faults. The main advantage of G-SWFIT is that software faults can be emulated without access to the source code. G-SWFIT fault injection process is guided by a set of operators representing realistic software faults [47]. These operators were derived from a field failure data study of more than 600 real software faults. These two works jointly represent a unique contribution, since they provide the first fault injection environment that can inject software faults which have been proven to be representative of real software faults. These works also constitute the foundation of a methodology for definition of faultloads based on software faults for dependability benchmarking [48].

Techniques for testing protocols for fault-tolerant distributed systems[edit]

Several fault injection tools and frameworks have been developed for testing of fault-handling protocols in distributed systems. The aim of this type of testing is to reveal design and implementation flaws in the tested protocol. The tests are performed by manipulating the content and/or the delivery of messages sent between nodes in the target system. This is referred as message-based fault injection. It resembles robustness testing in the sense that the faults are injected into the inputs of the target system.

A careful definition of the failure mode assumptions is crucial in the design of distributed fault-handling protocols. The failure mode assumptions provide a model of how faults in different subsystems (computing nodes, communication interfaces, and networks) affect a distributed system. A failure mode thus describes the impact of subsystem failures in a distributed system. Commonly assumed failure modes include Byzantine failures, timing failures, omission failures, crash failures, fail-stop failures and fail-signal failures. At the system-level, these subsystem failures correspond to faults. Hence, tools for message-based fault injection intend to inject faults that correspond to different subsystem failure modes.

The experimental environment for fault tolerance algorithms (EFA)[49] is an early example of a fault injector for message-based fault injection. The EFA environment provides a fault injection language that the protocol tester uses to specify the test cases.

The tool inserts fault injectors in each node of the target system and can implement several different fault types, including message omissions, sending a message several times, generating spontaneous messages, changing the timing of messages, and corrupting the contents of messages. A similar environment is provided by the DOCTOR tool [50] DUP-REF-18, which can cause messages to be lost, altered, duplicated or delayed.

Specifying test cases is a key problem in testing of distributed fault-handling protocols. A technique for defining test cases from Petri-net models of protocols in the EFA environment is described by Echtle [51]. An approach for defining test cases from an execution tree description of a protocol is described by Avresky [52].

ORCHESTRA is a framework for testing distributed applications and communication protocols [53][54]. This tool inserts a probe/fault injection layer (PFI) between any two consecutive layers in a protocol stack. The PFI layer can inject deterministic and randomly generated faults in both outgoing and incoming messages. ORCHESTRA was also used in a comparative study of six commercial implementations of the TCP protocol[55].

The failure of a distributed protocol often depends on the global state of the distributed system. It is therefore desirable for a human tester to control the global state of the target system. This involves controlling the states of a number of individually executing nodes, which is a challenging problem. Two tools that address this problem are CESIUM [56] and LOKI [57].

A fault injection environment for testing of Web-services, called WS-FIT, is presented in [58]. This fault injector can decode and inject meaningful faults into SOAP-messages. It uses an instrumented SOAP API that includes hooks allowing manipulation of both incoming and outgoing messages. A comparison of this method and fault injection by code insertion is presented in [59].

Surveys and books on fault injection[edit]

There are two surveys of fault injection tools and techniques [Clark 95, Hsueh 97]. However, both these surveys are more than ten years old and partly outdated.

Two books that address fault injection are Software fault injection : inoculating programs against errors by Jeffrey Voas and Gary McGraw [60] and Fault injection techniques and tools for embedded systems reliability evaluation by Benso and Prinetto (Eds.) [61].

Concluding remarks[edit]

The overview in this page covers tool and techniques for injecting three main fault types: physical hardware faults, software design and implementation faults, and faults affecting messages in distributed systems.

Many fault injection tools and techniques were proposed from the late 1980’s until the first years of the new millennium. Rather few new fault injection tools and techniques have been described in the literature during last four years. This does not mean that the interest, or need, for fault injection has diminished. Instead, researchers and practitioners now put more focus on using fault injection to verify and assess systems and individual fault-handling mechanisms, than on developing new tools and techniques.

Over the years, numerous papers have been published that describe results of such assessment and verification exercises. The goal of this overview is not to provide a complete record of all such experiments, but to bring general awareness of the concept and usefulness of Fault Injection to the public.

References[edit]

  1. ^ D. Avresky, J. Arlat, J.-C. Laprie, and Y. Crouzet, "Fault Injection for Formal Testing of Fault Tolerance," IEEE Transactions on Reliability, vol. 45, no. 3, pp. 443-455, 1996.
  2. ^ A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic Concepts and Taxonomy of Dependable and Secure Computing," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004.
  3. ^ J. Arlat, M. Aguera, L. Amat, Y. Crouzet, J.-C. Fabre, J.-C. Laprie, E. Martins, and D. Powell, "Fault Injection for Dependability Validation — a Methodology and Some Applications," IEEE Transactions on Software Engineering, vol. 16, pp. 166-182, February 1990.
  4. ^ H. Madeira, M. Rela, F. Moreira, and J. G. Silva, "A General Purpose Pin-Level Fault Injector," in European Dependable Computing Conference, Berlin, Germany, 1994, pp. 199-216.
  5. ^ J. Karlsson, U. Gunneflo, P. Liden, and J. Torin, "Two Fault Injection Techniques for Test of Fault Handling Mechanisms," in Int. Test Conference, Nashville, TN, USA, 1991.
  6. ^ G. Miremadi and J. Torin, "Evaluating Processor-Behavior and Three Error-Detection Mechanisms Using Physical Fault-Injection," IEEE Transactions on Reliability, vol. 44, 1995.
  7. ^ A. Rajabzadeh, S. G. Miremadi, and M. Mohandespour, "Experimental Evaluation of Master/Checker Architecture Using Power Supply and Software-Based Fault Injection," in 10th IEEE Int. On-Line Testing Symposium, Funchal, Madeira Island, Portugal, 2004.
  8. ^ IEEE-ISTO 5001 - the Nexus 5001 Forum Standard for a Global Embedded Processor Debug Interface: IEEE-ISTO, Piscataway, NJ 08854 USA, 2003.
  9. ^ IEEE Std 1149.1-2001 - IEEE Standard Test Access Port and Boundary-Scan Architecture: IEEE, Piscataway, NJ 08854 USA, 2001.
  10. ^ J. Aidemark, J. Vinter, P. Folkesson, J. Karlsson, “GOOFI: Generic Object-Oriented Fault Injection Tool”, IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2001, Göteborg, Sweden, July 2001, pp. 83-88.
  11. ^ P. Yuste, D. de Andres, L. Lemus, J. Serrano, and P. Gil, "Inerte: Integrated Nexus-Based Real-Time Fault Injection Tool for Embedded Systems," in Int. Conf. on Dependable Systems and Networks San Francisco, California, 2003, pp. 669-669.
  12. ^ M. Rebaudengo and M. Sonza Reorda, "Evaluating the Fault Tolerance Capabilities of Embedded Systems Via BDM," in 17th IEEE VLSI Test Symposium 1999, pp. 452-457.
  13. ^ J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, "Fault Injection Experiments Using Fiat," IEEE Transactions on Computers, vol. 39, pp. 575-582, 1990.
  14. ^ G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, "Ferrari: A Tool for the Validation of System Dependability Properties," in 22nd Int. Symp. on Fault-Tolerant Computing 1992, pp. 336-344.
  15. ^ W. I. Kao, R. K. Iyer, and D. Tang, "Fine: A Fault Injection and Monitoring Environment for Tracing the Unix System Behavior under Faults" IEEE Transactions on Software Engineering, vol. 19, pp. 1105-1118, 1993.
  16. ^ W. I. Kao and R. K. Iyer, "Define: A Distributed Fault Injection and Monitoring Environment," in Fault-Tolerant Parallel and Distributed Systems, D. Pradhan and D. Avresky, Eds.: IEEE Computer Society Press, 1995, pp. 252-259.
  17. ^ T. K. Tsai, R. K. Iyer, and D. Jewitt, "An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems," in 26th Int. Symp. on Fault Tolerant Computing, 1996, pp. 314-323.
  18. ^ S. Han, K. G. Shin, and H. A.´Rosenberg, "Doctor: An Integrated Software Fault Injection Environment for Distributed Real-Time Systems," in Int. Computer Performance and Dependability Symposium, Erlangen, Germany, 1995, pp. 204-213.
  19. ^ J. Carreira, H. Madeira, and J. G. Silva, "Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers," IEEE Transactions on Software Engineering, vol. 24, pp. 125-136, 1998.
  20. ^ J. Arlat, Y. Crouzet, J. Karlsson, P. Folkesson, E. Fuchs, and G. H. Leber, "Comparison of Physical and Software-Implemented Fault Injection Techniques," IEEE Transactions on Computers, vol. 52, pp. 1115-1133, 2003.
  21. ^ J. W. Kellington, R. McBeth, P. Sanda, and R. N. Kalla, "IBM Power6 Processor Soft Error Tolerance Analysis Using Proton Irradiation," in 3rd IEEE Workshop on Silicon Errors in Logic - Systems Effects (SELSE-3), Austin, TX, USA, 2007.
  22. ^ U. Gunneflo, J. Karlsson, and J. Torin, "Evaluation of Error Detection Schemes Using Fault Injection by Heavy-Ion Radiation," in 19th International Symposium on Fault-Tolerant Computing (FTCS-19), Chicago, IL, USA, 1989, pp. 340-347.
  23. ^ J. Karlsson, U. Gunneflo, P. Liden, and J. Torin, "Two Fault Injection Techniques for Test of Fault Handling Mechanisms," in Int. Test Conference, Nashville, TN, USA, 1991.
  24. ^ G. S. Choi and R. K. Iyer, "Focus: An Experimental Environment for Fault Sensitivity Analysis," IEEE Transactions on Computers, vol. 41, pp. 1515-1526, 1992.
  25. ^ E. Jenn, J. Arlat, M. Rimen, J. Ohlsson, and J. Karlsson, "Fault Injection into Vhdl Models: The Mefisto Tool," in 24th Int. Symposium on Fault-Tolerant Computing Pasadena, CA, USA, 1994, pp. 66-75.
  26. ^ T. A. Delong, B. W. Johnson, and J. A. Profeta, III, "A Fault Injection Technique for VHDL Behavioral-Level Models," IEEE Design & Test of Computers, vol. 13, pp. 24-33, 1996.
  27. ^ N. J. Wang and S. J. Patel, "Restore: Symptom-Based Soft Error Detection in Microprocessors," IEEE Transactions on Dependable and Secure Computing, vol. 3, pp. 188-201, 2006.
  28. ^ K. Goswami, "Depend: A Simulation-Based Environment for System Level Dependability Analysis," IEEE Transactions on Computers, vol. 46, pp. 60-74, 1997.
  29. ^ C. Kwang-Ting, H. Shi-Yu, and D. Wei-Jin, "Fault Emulation: A New Methodology for Fault Grading," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, vol. 18, pp. 1487-1495, 1999.
  30. ^ P. Civera, L. Macchiarulo, M. Rebaudengo, M. Sonza Reorda, and M. Violante, "New Techniques for Efficiently Assessing Reliability of Socs," Microelectronics Journal, vol. 34, pp. 53-61, 2003.
  31. ^ L. Antoni, R. Leveugle, and B. Feher, "Using Run-Time Reconfiguration for Fault Injection Applications," IEEE Transactions on Instrumentation and Measurement, vol. 52, pp. 1468-1473, 2003.
  32. ^ D. d. Andrés, J. C. Ruiz, D. Gil, and P. Gil, "Run-Time Reconfiguration for Emulating Transient Faults in Vlsi Systems," in Int. Conf. on Dependable Systems and Networks 2006, pp. 291-300.
  33. ^ D. T. Stott, B. Floering, D. Burke, Z. Kalbarczyk, and R. K. Iyer, "Nftape: A Framework for Assessing Dependability in Distributed Systems with Lightweight Fault Injectors," in IEEE International Computer Performance and Dependability Symposium (IPDS'00), 2000, pp. 91-100.
  34. ^ J. Duraes, H. Madeira, "Emulation of Software Faults: A Field Data Study and a Practical Approach", IEEE Transactions on Software Engineering, Vol. 32, # 11, pp. 849-867, IEEE, November 2006.
  35. ^ J. H. Barton, E. W. Czeck, Z. Z. Segall, and D. P. Siewiorek, "Fault Injection Experiments Using Fiat," IEEE Transactions on Computers, vol. 39, pp. 575-582, 1990.
  36. ^ G. A. Kanawati, N. A. Kanawati, and J. A. Abraham, "Ferrari: A Tool for the Validation of System Dependability Properties," in 22nd Int. Symp. on Fault-Tolerant Computing 1992, pp. 336-344.
  37. ^ T. K. Tsai, R. K. Iyer, and D. Jewitt, "An Approach Towards Benchmarking of Fault-Tolerant Commercial Systems," in 26th Int. Symp. on Fault Tolerant Computing, 1996, pp. 314-323.
  38. ^ S. Han, K. G. Shin, and H. A.´Rosenberg, "Doctor: An Integrated Software Fault Injection Environment for Distributed Real-Time Systems," in Int. Computer Performance and Dependability Symposium, Erlangen, Germany, 1995, pp. 204-213.
  39. ^ J. Carreira, H. Madeira, and J. G. Silva, "Xception: A Technique for the Experimental Evaluation of Dependability in Modern Computers," IEEE Transactions on Software Engineering, vol. 24, pp. 125-136, 1998.
  40. ^ Christmansson and R. Chillarege, "Generation of an Error Set That Emulates Software Faults Based on Field Data," in 26th Int. Symp. on Fault Tolerant Computing, 1996, pp. 304-313.
  41. ^ J. Christmansson, M. Hiller, and M. Rimen, "An Experimental Comparison of Fault and Error Injection," in 9th International Symposium on Software Reliability Engineering (ISSRE-9) 1998, pp. 369-378.
  42. ^ W. T. Ng, C. M.Aycock, G. Rajamani, and P. M. Chen, "Comparing Disk and Memory's Resistance to Operating System Crashes," in 7th Int. Symp. on Software Reliability Engineering, 1996, pp. 185-194.
  43. ^ W. T. Ng, W. T. Ng, and P. M. Chen, "The Systematic Improvement of Fault Tolerance in the Rio File Cache," in 29th Int. Symp. on Fault-Tolerant Computing, Madison, WI, USA, 1999, pp. 76-83.
  44. ^ W. I. Kao, R. K. Iyer, and D. Tang, "Fine: A Fault Injection and Monitoring Environment for Tracing the Unix System Behavior under Faults" IEEE Transactions on Software Engineering, vol. 19, pp. 1105-1118, 1993.
  45. ^ W.-l. Kao and R. K. Iyer, "Define: A Distributed Fault Injection and Monitoring Environment," in Fault-Tolerant Parallel and Distributed Systems, D. Pradhan and D. Avresky, Eds.: IEEE Computer Society Press, 1995, pp. 252-259.
  46. ^ J. Durães and H. Madeira, "Emulation of Software Faults by Educated Mutations at Machine-Code Level," in Int. Symp. on Software Reliability Engineering, 2002, pp. 329-340.
  47. ^ Durães and H. Madeira, "Definition of Software Fault Emulation Operators: A Field Data Study," in Int. Conf. on Dependable Systems and Networks, 2003, pp. 105-114.
  48. ^ J. Durães, Faultloads Based on Software Faults for Dependaility Benchmarking, PhD Thesis, Department of Information Engineering, University of Coimbra, 2006.
  49. ^ K. Echtle and M. Leu, "The EFA fault injector for fault-tolerant distributed system testing," in IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems, Amherst, MA, USA, 1992, pp. 28-35.
  50. ^ S. Han, K. G. Shin, and H. A.´Rosenberg, "Doctor: An Integrated Software Fault Injection Environment for Distributed Real-Time Systems," in Int. Computer Performance and Dependability Symposium, Erlangen, Germany, 1995, pp. 204-213.
  51. ^ K. Echtle and M. Leu, "Test of fault tolerant distributed systems by fault injection," in Fault-Tolerant Parallel and Distributed Systems, D. Pradhan and D. Avresky, Eds.: IEEE Computer Society Press, 1995, pp. 244-251.
  52. ^ D. Avresky, J. Arlat, J. C. Laprie, and Y. Crouzet, "Fault Injection for Formal Testing of Fault Tolerance," IEEE Transactions on Reliability, vol. 45, pp. 443-455, 1996
  53. ^ S. Dawson, F. Jahanian, and T. Mitton, "A software fault injection tool on real-time Mach," in 16th IEEE Real-Time Systems Symposium, 1995, pp. 130-140.
  54. ^ S. Dawson, F. Jahanian, and T. Mitton, "Testing of fault-tolerant and real-time distributed systems via protocol fault injection," in Int. Symp. on Fault Tolerant Computing, 1996, pp. 404-414.
  55. ^ S. Dawson, F. Jahanian, and T. Mitton, "Experiments on six commercial TCP implementations using a software fault injection tool," Software: Practice and Experience, vol. 27, pp. 1385-1410, 1997.
  56. ^ G. A. Alvarez and F. Cristian, "Centralized failure injection for distributed, fault-tolerant protocol testing," in 17th Int. Conf. on Distributed Computing Systems, 1997, pp. 78-85.
  57. ^ R. Chandra, R. M. Lefever, M. Cukier, and W. H. Sanders, "Loki: a state-driven fault injector for distributed systems," in Int. Conf. on Dependable Systems and Networks, 2000, pp. 237-242.
  58. ^ N. Looker, M. Munro, and J. Xie, "Simulating errors in web services," International Journal of Simulation: Systems, Science and Technology vol. 5, pp. 29-37, 2005.
  59. ^ N. Looker, M. Munro, and J. Xie, "A comparison of network level fault injection with code insertion," in 29th Int. Computer Software and Applications Conference, Edinburgh, Scotland 2005.
  60. ^ J. M. Voas and G. McGraw, Software fault injection : inoculating programs against errors. New York: Wiley, 1998
  61. ^ A. Benso and P. Prinetto, Fault injection techniques and tools for embedded systems reliability evaluation. Boston: Kluwer Academic Publishers, 2003

[[Category:Computing terminology]] [[Category:Systems_engineering]] [[Category:Safety]] [[Category:Safety engineering]] [[Category:Quality]]