A Survey of the EUROCONTROL Approach to Safety Risk Management R. H. Pierce, MSc.; CSE International Ltd; Flixborough, UK Keywords: ATM, ESARR, hazard, risk, separation(loss of) Abstract This paper takes a critical look at two of the current ESARRs, namely ESARR3 which covers safety management systems, and ESARR 4 which covers risk tolerability. ESARR 4 defines a series of hazard severities related to degrees of loss of separation (including the most serious, a mid-air collision, which attracts a numeric target level of safety) but in practice these are difficult to apply since any failure can lead to an accident with some (small) probability. Various ways of solving this problem and deriving useful equipment failure rate targets are discussed in the paper including statistical risk modelling and event tree analysis. Introduction EUROCONTROL is an organisation set up by agreement between European governments to further the safety of air navigation within Europe. EUROCONTROL is not synonymous with the European Union and has a much wider membership. One component of EUROCONTROL is the Safety Regulatory Commission (SRC), which has the task of defining a common regulatory standard for Air Traffic Services (ATS) providers. To this end the SRC, acting through its technical support component the Safety Regulation Unit (SRU), has established a set of EUROCONTROL Safety Regulatory Requirements (ESARRs) which all national ATS regulators are intended to apply in their own jurisdiction. There are six ESARRs currently promulgated. • ESARR 1: Safety Oversight in ATM • ESARR 2: Incident Reporting and Assessment • ESARR 3: Use of Safety Management Systems • ESARR 4: Risk Assessment and Mitigation in ATM • ESARR 5: Personnel Requirements • ESARR 6: Software in ATM Systems. This paper takes a critical look at two of these ESARRS, namely ESARR3 which covers safety management systems, and ESARR 4, which covers risk tolerability. ESARR 4 defines a series of hazard severities related to degrees of loss of separation (including the most serious, a mid-air collision) but in practice these are difficult to apply since any failure can lead to an accident with some (small) probability. Various ways of solving this problem and deriving useful equipment failure rate targets are discussed in the paper including statistical risk modelling and event tree analysis. Hazards and Risks in ATM ATM systems are probably unique among plant and transportation control systems in that the human being (the Air Traffic Control Officer or ATCO as he or she is known in Europe) provides the control function, while the equipment provides information and communications services to support the ATCO in this task. This is by contrast to other industries where the equipment (control and protection systems) provides the control function with the human being acting in a monitoring or supervisory role. ATCOs provide a continuous control service, again by contrast to the case in most industries where human intervention is the exception rather than the rule (even in a railway signalling system where the signaller has to request routes for trains on a frequent basis, the signalling system or “interlocking” is designed to prevent unsafe actions from being carried out). The main hazards that an ATM system can create at the interface between the air traffic control centre and the aircraft are: • Loss of separation between aircraft, • Loss of separation between aircraft and terrain. There are other issues, for example failure to advise the aircrew of adverse weather (such as windshear) or providing incorrect meteorological data (altimeter pressure settings), but these may be regarded as causal factors leading to one of the top level hazards. Loss of separation is the main concern of most ATC centres and of ESARR 4 (ref. 1). Loss of separation can lead to catastrophic consequences, the most notorious recent case being the mid-air collision near Überlingen in southern Germany in July 2002 (ref. 2). In other cases, such as the close approach of two wide-body aircraft over Japan in 2000, a number of passenger and cabin crew injuries were caused either by wake turbulence or violent avoiding action. Standard separation between aircraft in Europe is 5nm horizontally or 1000ft vertically, in the case where radar surveillance service is available (greater separations are applied where there is no radar cover, and a 3nm separation is often applied on final approach). Erosion of separation is regarded as an incident reportable to the regulator. It will be clear to the reader that not all losses of separation are in fact hazardous, and in fact there is a continuum of severity between a minor technical infringement of separation and a very close approach or collision. Causal factors for loss of separation can be divided into ATCO error, pilot error or equipment failure. As might be expected, the contribution of human error to losses of separation is dominant. Estimates of the proportion of ATM-related incidents caused directly by human error as opposed to those caused by equipment failure range from 95% to 98% or even more. However, we should note that a major contributory factor to the Überlingen accident was the severely (intentionally) degraded mode in which the ATM equipment was operating on the evening of the accident, which was not properly understood by the ATCO in control. ESARR 3 ESARR 3 (ref. 3) requires all ATS providers to implement a safety management system (SMS) and sets out requirements for the topics that the SMS must address. The safety objective of the ATM service is stated to be “while providing an ATM service, the principal safety objective is to minimise the ATM contribution to the risk of an aircraft accident so far as reasonable practicable” Risk reduction so far as reasonably practicable is exactly equivalent to the principle that risks should be As Low As Reasonably Practicable (ALARP). The ALARP principle originated in the United Kingdom and is well explained in the UK Health and Safety Executive document Reducing Risks, Protecting People (ref. 4). In a nutshell, ALARP calls for risk reduction measures to be taken until any further expenditure would be out of proportion to the gain achieved. Of course, the risk must be tolerable in the first place before the ALARP principle is applied: one cannot put an intolerably unsafe system into service on the ground that it is not reasonably practical to improve it. The degree of expenditure called for by ALARP will depend on the residual risk. If this is towards the upper end of the tolerability band then considerable expenditure may be called for, if the risk is already low then only modest improvements may be required. In practice, the ALARP principle is usually implemented by means of a hazard-risk index or risk classification scheme, which will be familiar to many safety engineers. A typical risk classification scheme applied to ATM equipment is shown in the following matrix (application of such a scheme to human error is controversial and will not be considered further in this paper). The hazard severity classes have the following summarised interpretations: 1 Inability to provide any form of air traffic control service 2 Ability to provide an air traffic control service is severely compromised for a significant period of time 3 Ability to provide an air traffic control service is impaired for a significant period of time 4 No immediate effect on safety but persistence may cause a loss of safety margins Table 1 - ATM Equipment Hazard Severities Severity Class Hazard Occurrence Rate (/h) 4 3 2 1 > 10-3 C A A A 10-3 to 10-4 D B A A 10-4 to 10-5 D C B A 10-5 to 10-6 D D C B 10-6 to 10-7 D D D C < 10-7 D D D D Table 2 - ATM Equipment Risk Classification Matrix In this case, a risk class A is intolerable, class B and C are in the ALARP area, while class D is acceptable and there is no need for further application of the ALARP principle. Under the ALARP principle a risk class D should be the target for all systems and a higher risk is only acceptable if class D cannot reasonably be achieved. Any ATM hazard can lead to a loss of separation and to a finite, although generally very small, probability of a mid-air collision - an accident with multiple fatalities. It is therefore not useful to classify ATM equipment hazards severity by means of the expected harm, since in this case all hazards would be classed as Catastrophic in severity. This would lead to unreasonably stringent failure rate targets being necessary to achieve a risk class D, and would not distinguish hazards that in practice are of differing severity and require more or less stringent control measures. A more useful classification is to consider the effect of the hazard on the ability of the ATCOs to maintain a safe air traffic service (and therefore indirectly on the probability of a loss of separation and mid-air collision). This approach has been used by a number of ATM service providers in Europe for a number of years, and is expressed in Table 1. The approach taken by ESARR 4, and the difficulties it presents, is discussed in the next section. ESARR 4 Requirements for Risk Control ESARR 4 defines five levels of hazard severity: Severity Class Description 1 Accidents, including mid-air collisions or controlled flight into terrain 2 Serious incidents 3 Minor incidents 4 Significant incidents 5 No immediate effect on safety Table 3 - ESARR 4 Incident and Hazard Severities Apart from Class 1 incidents, the other classes are not defined in detail but examples are given of the effect of the hazard on operations. For example, a Class 2 incident is one which involves “large reduction in separation….without crew or ATC fully controlling the situation or able to recover from the situation” or “one or more aircraft deviating from their intended clearance so that abrupt manoeuvre is required to avoid collision………”. These can only be examples, because “provision of an erroneous ATC clearance such that abrupt manoeuvre is required to avoid collision…..” would be an equally valid example of a Class 2 hazard. The initiating event for the sequence that led to the Überlingen accident was in fact an incorrect flight level clearance. The ESARR 4 hazard descriptions are consistent with those in ESARR 2, which is concerned with establishing a common framework for incident data collection and reporting. A severity 1 occurrence is by definition not a hazard but an accident. For such an incident, ESARR 4 sets out a Target Level of Safety (TLS) in terms of accident rates (mishaps involving harm to human beings) caused by ATM factors. The TLS is 1.55 × 10-8 accidents per flight hour. This may be thought of as the upper limit of tolerability for accidents resulting from the ATM hazards discussed in the previous section. Application of the ALARP principle from ESARR 3 however would imply that service providers should attempt to do better than the TLS. This is in fact stated indirectly in ESARR 4 in the sentence “As a necessary complement to demonstrating that these quantitative objectives are met, additional safety management considerations shall be applied so that more safety is added to the ATM system whenever reasonable”. This appears to be a rather convoluted re-statement of the ALARP principle. Tolerable occurrence rates of Class 2 to Class 4 hazards are not given in the present version of ESARR 4, although it is intended that this information should be included when sufficient incident data has been collected to establish the relationship between the occurrence rate of incidents of various classes and accidents. Individual ATM service providers are thus left with the problem of deciding for themselves what maximum tolerable probabilities to assign to the other hazard severities. However, the provision of an overall TLS is a valuable first step in setting quantitative safety objectives. It should be noted in this regard that the TLS is set in terms of flight hours. Any ATM centre should reexpress the TLS in terms of units which are appropriate to it, given its traffic levels and the average time for which a flight is under the control of that centre (for an en-route centre in Europe, this is typically about 20 minutes). If such a centre handles one million flights per year, the TLS for that centre in terms of accidents per operating hour would be close to 6 × 10-7. As was discussed earlier, it is very difficult to use the ESARR 4 risk classification scheme directly to set equipment hazard rate targets, for the simple reason that it is not possible to examine an equipment failure (or even a human error) and state immediately that it will lead to any of the above incident severity classes. Indeed, ESARR 4 recognises this by stating that the risk classification scheme only applies to an overall safety performance at the national level and is not directly applicable to the classification of individual hazards. It is therefore necessary to devise a scheme for setting safety requirements, which is consistent with ESARR 4, and with the ALARP principle. Furthermore, when considering ATM equipment we must remember that the TLS has to be apportioned between equipment and human contributions. Because the proportion of ATM incidents due to human error is very high, as discussed above, only about 5% of the TLS can be allocated to equipment. So the TLS for equipment hazards needs to be something like 7.75 × 10-10 per flight hour (or 3 × 10-8 per operating hour). How is this problem to be solved? Various methods can be used, including consequence modelling and calibration of a suitable risk classification matrix. These are discussed in the next two sections. A point to note here is that ESARR 4 does not allow certain systems classed as “safety nets” to be taken into account in determining quantitative safety requirements for ATM systems or equipment. A case in point is the Traffic Alert and Collision Avoidance System or TCAS. This is a system which detects the presence of another aircraft and warns the crew if it is likely to approach too closely; if both aircraft are TCAS equipped the systems negotiate a resolution advisory (RA) which the aircrew should obey. TCAS is a very effective system in that it frequently prevents infringements of separation turning into more serious incidents. However, in the Überlingen accident it was a contributory factor in the accident sequence since the RA issued to one of the aircraft contradicted the instruction of the ATCO on duty and the pilot chose to obey the ATCO rather than the TCAS. It seems reasonable therefore not to take credit for this system. Event Tree Modelling One approach to the application of ESARR 4 to equipment is to carry out detailed consequence modelling for each ATM equipment failure hazard to determine the range of possible outcomes in terms of their severity class, and the relative probability of each outcome. Using this information a safety requirement can be stated which would ensure that the tolerable hazard rates at the ESARR 4 level would be met. The event tree method is well suited for this purpose. The initiating event is the hazard (which will be some equipment failure mode such as loss of the radar display). The subsequent events in the tree represent the success or failure of the various mitigating factors (barriers to escalation as they are sometimes known) that stand between the hazard at this level and the various outcomes, expressed as ESARR 4 hazard severities. The following example of an event tree illustrates how the method might be used. The system being considered is an electronic flight data display for airport control towers, which gives the tower controller information such the destination, callsign, SSR transponder code, take-off time and departure route of each aircraft. The system is intended to assist the controller by deciding the sequence of departures and the time at which each aircraft can be cleared to take off. A crucial factor in this calculation is aircraft wake turbulence category of each aircraft. It is well known that aircraft create severe vortices behind them when taking off, and this can affect the stability of the following aircraft. A certain time has therefore to be left between each aircraft to let the resulting turbulence subside. Aircraft are divided into four wake turbulence categories depending on their weight, and there are mandatory time separations to be applied between take-offs depending on the wake turbulence categories of the leading and following aircraft. The hazard to be considered is “calculation of incorrect take-off times with respect to wake turbulence categories”. In this case hazard could lead directly to an accident, for example a light turboprop aircraft taking off a minute behind a Boeing 747 could easily lose stability and crash into the ground. The event tree for this scenario is given in Figure 1 below. This is not intended to be a complete analysis of the situation but is reasonably representative of the method; however, the actual success probabilities are purely conjectural. Figure 1 - Event Tree for Accident Scenario There are some problems with using an event tree in this way. The first is that assigning probabilities to the intermediate events is often difficult. For example, it may be hard to assign a credible success probability to an event involving human behaviour such as the detection of an anomalous situation, or the application of a corrective action. Unless good human error statistics have been kept, which is not often the case, it may be necessary to rely on expert judgement, which can be contentious especially if the situation being analysed is only infrequently encountered. In other cases it may be necessary to carry out detailed separation infringement or collision probability modelling to determine the relative likelihood of the ESARR 4 severity classes. An example of collision probability modelling using simulation and expert judgement is given in an airspace risk assessment conducted by Airservices Australia (ref. 5), although this was not concerned with assigning safety requirements to ATM equipment. The other problem is simply that developing an event tree or other consequence model for every equipment failure mode can be very time consuming. Calibrated Risk Classification Matrix In this approach, a conventional risk classification matrix as shown in Table 2 is used, but the hazard rate figures are adjusted to be consistent with the TLS for the ATS unit in question (converted to units of accidents per operating hour). Once this calibration of the risk classification matrix has been achieved, individual projects can use it to derive safety requirements in terms of hazard occurrence rates without performing detailed modelling. A simple method of calibration is to consider a Class 1 hazard at the equipment level (inability to provide any form of ATC service, see Table 1) and decide the relative likelihood of an ESARR 4 severity class 1 incident, in other words an aircraft accident. This will then set the maximum tolerable occurrence rate for an equipment Class 1 hazard (the maximum occurrence rate which will still achieve a Class B risk). Hazards of lesser severity are then assigned successively lower targets, usually with an order of magnitude reduction for each severity class, as shown in Table 2. Even if an equipment class 1 hazard occurs (which is normally a complete loss of communications between ATC and aircraft), and the aircraft are left to proceed without instructions from ATC, there is still a very small chance that a mid-air collision will occur. Informally, this is mainly because the sky is a big place and aircraft are relatively small, and because the aircraft will generally be properly separated before the system failure occurs. Aircrew can also mitigate the collision risk by other means such as contacting other ATM service providers (such as major airports or adjacent centres) for emergency traffic information, and increased vigilance. They can in practice also rely on TCAS but as noted above this cannot be taken into account. Modelling work and incident (AIRPROX) statistics indicates that there are least two orders of magnitude of mitigation between the Class 1 failure and a mid-air collision, and possibly as much as three orders of magnitude. As with all risk classification and safety requirements derivation schemes, it is inappropriate to consider the risk from each hazard individually, because the contribution to the overall achieved incident and accident rate is the sum of the incident rates from each hazard (assuming that all the hazards can occur independently). Treating hazards individually in this way is sometimes referred to as “salami slicing risk”. Generally, some assumption is made about the total number of such hazards that could occur, and the failure rate bands in the risk classification matrix are set accordingly. For example, it may be assumed that there is only one system which could cause a Class 1 failure, and a further 10 systems each with 5 Class 2 and 5 Class 3 hazards. However, this kind of assumption needs to be re-visited regularly to cater for changes to the equipment in use, as discussed in the next section. Altering the balance of risk between equipment and people As ATM equipment becomes more advanced and offers ATC staff more tools to manage traffic, an increasing number of hazards can be created by equipment failures simply due to the number of new functions that are provided and could then fail. This could result in an increased contribution to the risk budget from equipment failures, and in the end may require a re-calibration of the risk classification matrix. However, the new features and functions may well improve human error rates, and therefore reduce the overall contribution from human error. Since human error contributes something like 95% of the overall risk, any improvements in this area should result in an overall risk reduction even if the equipment contribution rises somewhat. Conclusions The EUROCONTROL requirements set out in ESARR 3 and ESARR 4 set out the overall risk tolerability framework for ATM in Europe. Although ESARR 4 is very valuable in providing a numeric target for the ATM contribution to aircraft accidents, it does not provide detailed requirements for the control of less severe hazards, and individual ATS units must apply it with care and thought to their own situations. Moreover, ESARR 3 and 4 also call for risk to be reduced ALARP. Proper application of ESARRs 3 and 4 should therefore result in a European ATM system which is very safe, and equally safe regardless of which country’s airspace is being traversed. References 1. EUROCONTROL SRC, ESARR 4, Risk Assessment and Mitigation in ATM, Edition 1.0, 2001. 2. Bundesstelle fur Flugunfalluntersucherung, Investigation Report AX001-1-2/03. Braunschweig, 2004. 3. EUROCONTROL SRC, ESARR 3, Use of Safety Management Systems by ATM Service Providers, Edition 1.0, 2000. 4. Health and Safety Executive, Reducing Risk, Protecting People: HSE’s Decision Making Process. London: 2001. 5. Airservices Australia, Airspace Risk Assessment, Class E over Class D Towers, Version 1.0, 2004, http://www.airservicesaustralia.com/pilotcentre/nas/. Biography R. H. Pierce, MSc., Consulting Engineer, CSE International Ltd, Glanford House, Flixborough, Scunthorpe DN15 8SN, UK. Telephone - +44 1724 862169, facsimile - +44 1724 856256, e-mail – ron.pierce@cse-international.com. Mr. Pierce has extensive experience in software engineering topics (compilers, program analysis tools and software engineering methods). He has over 12 years experience in software and system safety assessment for industry domains including air traffic management and railway control and signalling systems.作者: 涟漪雨 时间: 2010-11-11 10:04:00
认真学习!作者: braveofwind 时间: 2011-2-17 19:44:38
需要看看啊
作者: topgun008 时间: 2011-5-9 08:41:55
eurocontrol SMS作者: kmlihe 时间: 2015-7-4 14:10:47
thank you very much作者: buaawu 时间: 2016-3-22 14:44:58