- 注册时间
- 2008-9-13
- 最后登录
- 1970-1-1
- 在线时间
- 0 小时
- 阅读权限
- 200
- 积分
- 0
- 帖子
- 24482
- 精华
- 4
- UID
- 9
|
A Survey of the EUROCONTROL Approach to Safety Risk Management
R. H. Pierce, MSc.; CSE International Ltd; Flixborough, UK
Keywords: ATM, ESARR, hazard, risk, separation(loss of)
Abstract
This paper takes a critical look at two of the current ESARRs, namely ESARR3 which covers safety
management systems, and ESARR 4 which covers risk tolerability. ESARR 4 defines a series of hazard
severities related to degrees of loss of separation (including the most serious, a mid-air collision, which
attracts a numeric target level of safety) but in practice these are difficult to apply since any failure can
lead to an accident with some (small) probability. Various ways of solving this problem and deriving
useful equipment failure rate targets are discussed in the paper including statistical risk modelling and
event tree analysis.
Introduction
EUROCONTROL is an organisation set up by agreement between European governments to further the
safety of air navigation within Europe. EUROCONTROL is not synonymous with the European Union
and has a much wider membership. One component of EUROCONTROL is the Safety Regulatory
Commission (SRC), which has the task of defining a common regulatory standard for Air Traffic Services
(ATS) providers.
To this end the SRC, acting through its technical support component the Safety Regulation Unit (SRU),
has established a set of EUROCONTROL Safety Regulatory Requirements (ESARRs) which all national
ATS regulators are intended to apply in their own jurisdiction. There are six ESARRs currently
promulgated.
• ESARR 1: Safety Oversight in ATM
• ESARR 2: Incident Reporting and Assessment
• ESARR 3: Use of Safety Management Systems
• ESARR 4: Risk Assessment and Mitigation in ATM
• ESARR 5: Personnel Requirements
• ESARR 6: Software in ATM Systems.
This paper takes a critical look at two of these ESARRS, namely ESARR3 which covers safety
management systems, and ESARR 4, which covers risk tolerability. ESARR 4 defines a series of hazard
severities related to degrees of loss of separation (including the most serious, a mid-air collision) but in
practice these are difficult to apply since any failure can lead to an accident with some (small) probability.
Various ways of solving this problem and deriving useful equipment failure rate targets are discussed in
the paper including statistical risk modelling and event tree analysis.
Hazards and Risks in ATM
ATM systems are probably unique among plant and transportation control systems in that the human
being (the Air Traffic Control Officer or ATCO as he or she is known in Europe) provides the control
function, while the equipment provides information and communications services to support the ATCO in
this task. This is by contrast to other industries where the equipment (control and protection systems)
provides the control function with the human being acting in a monitoring or supervisory role. ATCOs
provide a continuous control service, again by contrast to the case in most industries where human
intervention is the exception rather than the rule (even in a railway signalling system where the signaller
has to request routes for trains on a frequent basis, the signalling system or “interlocking” is designed to
prevent unsafe actions from being carried out).
The main hazards that an ATM system can create at the interface between the air traffic control centre and
the aircraft are:
• Loss of separation between aircraft,
• Loss of separation between aircraft and terrain.
There are other issues, for example failure to advise the aircrew of adverse weather (such as windshear) or
providing incorrect meteorological data (altimeter pressure settings), but these may be regarded as causal
factors leading to one of the top level hazards. Loss of separation is the main concern of most ATC
centres and of ESARR 4 (ref. 1). Loss of separation can lead to catastrophic consequences, the most
notorious recent case being the mid-air collision near Überlingen in southern Germany in July 2002
(ref. 2). In other cases, such as the close approach of two wide-body aircraft over Japan in 2000, a
number of passenger and cabin crew injuries were caused either by wake turbulence or violent avoiding
action.
Standard separation between aircraft in Europe is 5nm horizontally or 1000ft vertically, in the case where
radar surveillance service is available (greater separations are applied where there is no radar cover, and a
3nm separation is often applied on final approach). Erosion of separation is regarded as an incident
reportable to the regulator. It will be clear to the reader that not all losses of separation are in fact
hazardous, and in fact there is a continuum of severity between a minor technical infringement of
separation and a very close approach or collision.
Causal factors for loss of separation can be divided into ATCO error, pilot error or equipment failure. As
might be expected, the contribution of human error to losses of separation is dominant. Estimates of the
proportion of ATM-related incidents caused directly by human error as opposed to those caused by
equipment failure range from 95% to 98% or even more. However, we should note that a major
contributory factor to the Überlingen accident was the severely (intentionally) degraded mode in which
the ATM equipment was operating on the evening of the accident, which was not properly understood by
the ATCO in control.
ESARR 3
ESARR 3 (ref. 3) requires all ATS providers to implement a safety management system (SMS) and sets
out requirements for the topics that the SMS must address. The safety objective of the ATM service is
stated to be “while providing an ATM service, the principal safety objective is to minimise the ATM
contribution to the risk of an aircraft accident so far as reasonable practicable”
Risk reduction so far as reasonably practicable is exactly equivalent to the principle that risks should be
As Low As Reasonably Practicable (ALARP). The ALARP principle originated in the United Kingdom
and is well explained in the UK Health and Safety Executive document Reducing Risks, Protecting
People (ref. 4). In a nutshell, ALARP calls for risk reduction measures to be taken until any further
expenditure would be out of proportion to the gain achieved. Of course, the risk must be tolerable in the
first place before the ALARP principle is applied: one cannot put an intolerably unsafe system into
service on the ground that it is not reasonably practical to improve it. The degree of expenditure called for
by ALARP will depend on the residual risk. If this is towards the upper end of the tolerability band then
considerable expenditure may be called for, if the risk is already low then only modest improvements may
be required.
In practice, the ALARP principle is usually implemented by means of a hazard-risk index or risk
classification scheme, which will be familiar to many safety engineers. A typical risk classification
scheme applied to ATM equipment is shown in the following matrix (application of such a scheme to
human error is controversial and will not be considered further in this paper).
The hazard severity classes have the following summarised interpretations:
1 Inability to provide any form of air traffic control service
2 Ability to provide an air traffic control service is severely compromised for a
significant period of time
3 Ability to provide an air traffic control service is impaired for a significant period of
time
4 No immediate effect on safety but persistence may cause a loss of safety margins
Table 1 - ATM Equipment Hazard Severities
Severity Class
Hazard
Occurrence
Rate (/h)
4 3 2 1
> 10-3 C A A A
10-3 to 10-4 D B A A
10-4 to 10-5 D C B A
10-5 to 10-6 D D C B
10-6 to 10-7 D D D C
< 10-7 D D D D
Table 2 - ATM Equipment Risk Classification Matrix
In this case, a risk class A is intolerable, class B and C are in the ALARP area, while class D is acceptable
and there is no need for further application of the ALARP principle. Under the ALARP principle a risk
class D should be the target for all systems and a higher risk is only acceptable if class D cannot
reasonably be achieved.
Any ATM hazard can lead to a loss of separation and to a finite, although generally very small,
probability of a mid-air collision - an accident with multiple fatalities. It is therefore not useful to classify
ATM equipment hazards severity by means of the expected harm, since in this case all hazards would be
classed as Catastrophic in severity. This would lead to unreasonably stringent failure rate targets being
necessary to achieve a risk class D, and would not distinguish hazards that in practice are of differing
severity and require more or less stringent control measures. A more useful classification is to consider
the effect of the hazard on the ability of the ATCOs to maintain a safe air traffic service (and therefore
indirectly on the probability of a loss of separation and mid-air collision). This approach has been used by
a number of ATM service providers in Europe for a number of years, and is expressed in Table 1.
The approach taken by ESARR 4, and the difficulties it presents, is discussed in the next section.
ESARR 4 Requirements for Risk Control
ESARR 4 defines five levels of hazard severity:
Severity Class Description
1 Accidents, including mid-air collisions or controlled flight into terrain
2 Serious incidents
3 Minor incidents
4 Significant incidents
5 No immediate effect on safety
Table 3 - ESARR 4 Incident and Hazard Severities
Apart from Class 1 incidents, the other classes are not defined in detail but examples are given of the
effect of the hazard on operations. For example, a Class 2 incident is one which involves “large reduction
in separation….without crew or ATC fully controlling the situation or able to recover from the situation”
or “one or more aircraft deviating from their intended clearance so that abrupt manoeuvre is required to
avoid collision………”. These can only be examples, because “provision of an erroneous ATC clearance
such that abrupt manoeuvre is required to avoid collision…..” would be an equally valid example of a
Class 2 hazard. The initiating event for the sequence that led to the Überlingen accident was in fact an
incorrect flight level clearance.
The ESARR 4 hazard descriptions are consistent with those in ESARR 2, which is concerned with
establishing a common framework for incident data collection and reporting.
A severity 1 occurrence is by definition not a hazard but an accident. For such an incident, ESARR 4 sets
out a Target Level of Safety (TLS) in terms of accident rates (mishaps involving harm to human beings)
caused by ATM factors. The TLS is 1.55 × 10-8 accidents per flight hour. This may be thought of as the
upper limit of tolerability for accidents resulting from the ATM hazards discussed in the previous section.
Application of the ALARP principle from ESARR 3 however would imply that service providers should
attempt to do better than the TLS. This is in fact stated indirectly in ESARR 4 in the sentence “As a
necessary complement to demonstrating that these quantitative objectives are met, additional safety
management considerations shall be applied so that more safety is added to the ATM system whenever
reasonable”. This appears to be a rather convoluted re-statement of the ALARP principle.
Tolerable occurrence rates of Class 2 to Class 4 hazards are not given in the present version of ESARR 4,
although it is intended that this information should be included when sufficient incident data has been
collected to establish the relationship between the occurrence rate of incidents of various classes and
accidents. Individual ATM service providers are thus left with the problem of deciding for themselves
what maximum tolerable probabilities to assign to the other hazard severities. However, the provision of
an overall TLS is a valuable first step in setting quantitative safety objectives.
It should be noted in this regard that the TLS is set in terms of flight hours. Any ATM centre should reexpress
the TLS in terms of units which are appropriate to it, given its traffic levels and the average time
for which a flight is under the control of that centre (for an en-route centre in Europe, this is typically
about 20 minutes). If such a centre handles one million flights per year, the TLS for that centre in terms of
accidents per operating hour would be close to 6 × 10-7.
As was discussed earlier, it is very difficult to use the ESARR 4 risk classification scheme directly to set
equipment hazard rate targets, for the simple reason that it is not possible to examine an equipment failure
(or even a human error) and state immediately that it will lead to any of the above incident severity
classes. Indeed, ESARR 4 recognises this by stating that the risk classification scheme only applies to an
overall safety performance at the national level and is not directly applicable to the classification of
individual hazards. It is therefore necessary to devise a scheme for setting safety requirements, which is
consistent with ESARR 4, and with the ALARP principle. Furthermore, when considering ATM
equipment we must remember that the TLS has to be apportioned between equipment and human
contributions. Because the proportion of ATM incidents due to human error is very high, as discussed
above, only about 5% of the TLS can be allocated to equipment. So the TLS for equipment hazards needs
to be something like 7.75 × 10-10 per flight hour (or 3 × 10-8 per operating hour).
How is this problem to be solved? Various methods can be used, including consequence modelling and
calibration of a suitable risk classification matrix. These are discussed in the next two sections.
A point to note here is that ESARR 4 does not allow certain systems classed as “safety nets” to be taken
into account in determining quantitative safety requirements for ATM systems or equipment. A case in
point is the Traffic Alert and Collision Avoidance System or TCAS. This is a system which detects the
presence of another aircraft and warns the crew if it is likely to approach too closely; if both aircraft are
TCAS equipped the systems negotiate a resolution advisory (RA) which the aircrew should obey. TCAS
is a very effective system in that it frequently prevents infringements of separation turning into more
serious incidents. However, in the Überlingen accident it was a contributory factor in the accident
sequence since the RA issued to one of the aircraft contradicted the instruction of the ATCO on duty and
the pilot chose to obey the ATCO rather than the TCAS. It seems reasonable therefore not to take credit
for this system.
Event Tree Modelling
One approach to the application of ESARR 4 to equipment is to carry out detailed consequence modelling
for each ATM equipment failure hazard to determine the range of possible outcomes in terms of their
severity class, and the relative probability of each outcome. Using this information a safety requirement
can be stated which would ensure that the tolerable hazard rates at the ESARR 4 level would be met.
The event tree method is well suited for this purpose. The initiating event is the hazard (which will be
some equipment failure mode such as loss of the radar display). The subsequent events in the tree
represent the success or failure of the various mitigating factors (barriers to escalation as they are
sometimes known) that stand between the hazard at this level and the various outcomes, expressed as
ESARR 4 hazard severities.
The following example of an event tree illustrates how the method might be used. The system being
considered is an electronic flight data display for airport control towers, which gives the tower controller
information such the destination, callsign, SSR transponder code, take-off time and departure route of
each aircraft. The system is intended to assist the controller by deciding the sequence of departures and
the time at which each aircraft can be cleared to take off. A crucial factor in this calculation is aircraft
wake turbulence category of each aircraft. It is well known that aircraft create severe vortices behind
them when taking off, and this can affect the stability of the following aircraft. A certain time has
therefore to be left between each aircraft to let the resulting turbulence subside.
Aircraft are divided into four wake turbulence categories depending on their weight, and there are
mandatory time separations to be applied between take-offs depending on the wake turbulence categories
of the leading and following aircraft. The hazard to be considered is “calculation of incorrect take-off
times with respect to wake turbulence categories”. In this case hazard could lead directly to an accident,
for example a light turboprop aircraft taking off a minute behind a Boeing 747 could easily lose stability
and crash into the ground.
The event tree for this scenario is given in Figure 1 below. This is not intended to be a complete analysis
of the situation but is reasonably representative of the method; however, the actual success probabilities
are purely conjectural.
Figure 1 - Event Tree for Accident Scenario
There are some problems with using an event tree in this way. The first is that assigning probabilities to
the intermediate events is often difficult. For example, it may be hard to assign a credible success
probability to an event involving human behaviour such as the detection of an anomalous situation, or the
application of a corrective action. Unless good human error statistics have been kept, which is not often
the case, it may be necessary to rely on expert judgement, which can be contentious especially if the
situation being analysed is only infrequently encountered. In other cases it may be necessary to carry out
detailed separation infringement or collision probability modelling to determine the relative likelihood of
the ESARR 4 severity classes. An example of collision probability modelling using simulation and expert
judgement is given in an airspace risk assessment conducted by Airservices Australia (ref. 5), although
this was not concerned with assigning safety requirements to ATM equipment.
The other problem is simply that developing an event tree or other consequence model for every
equipment failure mode can be very time consuming.
Calibrated Risk Classification Matrix
In this approach, a conventional risk classification matrix as shown in Table 2 is used, but the hazard rate
figures are adjusted to be consistent with the TLS for the ATS unit in question (converted to units of
accidents per operating hour). Once this calibration of the risk classification matrix has been achieved,
individual projects can use it to derive safety requirements in terms of hazard occurrence rates without
performing detailed modelling.
A simple method of calibration is to consider a Class 1 hazard at the equipment level (inability to provide
any form of ATC service, see Table 1) and decide the relative likelihood of an ESARR 4 severity class 1
incident, in other words an aircraft accident. This will then set the maximum tolerable occurrence rate for
an equipment Class 1 hazard (the maximum occurrence rate which will still achieve a Class B risk).
Hazards of lesser severity are then assigned successively lower targets, usually with an order of
magnitude reduction for each severity class, as shown in Table 2.
Even if an equipment class 1 hazard occurs (which is normally a complete loss of communications
between ATC and aircraft), and the aircraft are left to proceed without instructions from ATC, there is
still a very small chance that a mid-air collision will occur. Informally, this is mainly because the sky is a
big place and aircraft are relatively small, and because the aircraft will generally be properly separated
before the system failure occurs. Aircrew can also mitigate the collision risk by other means such as
contacting other ATM service providers (such as major airports or adjacent centres) for emergency traffic
information, and increased vigilance. They can in practice also rely on TCAS but as noted above this
cannot be taken into account. Modelling work and incident (AIRPROX) statistics indicates that there are
least two orders of magnitude of mitigation between the Class 1 failure and a mid-air collision, and
possibly as much as three orders of magnitude.
As with all risk classification and safety requirements derivation schemes, it is inappropriate to consider
the risk from each hazard individually, because the contribution to the overall achieved incident and
accident rate is the sum of the incident rates from each hazard (assuming that all the hazards can occur
independently). Treating hazards individually in this way is sometimes referred to as “salami slicing risk”.
Generally, some assumption is made about the total number of such hazards that could occur, and the
failure rate bands in the risk classification matrix are set accordingly. For example, it may be assumed
that there is only one system which could cause a Class 1 failure, and a further 10 systems each with 5
Class 2 and 5 Class 3 hazards. However, this kind of assumption needs to be re-visited regularly to cater
for changes to the equipment in use, as discussed in the next section.
Altering the balance of risk between equipment and people
As ATM equipment becomes more advanced and offers ATC staff more tools to manage traffic, an
increasing number of hazards can be created by equipment failures simply due to the number of new
functions that are provided and could then fail. This could result in an increased contribution to the risk
budget from equipment failures, and in the end may require a re-calibration of the risk classification
matrix. However, the new features and functions may well improve human error rates, and therefore
reduce the overall contribution from human error. Since human error contributes something like 95% of
the overall risk, any improvements in this area should result in an overall risk reduction even if the
equipment contribution rises somewhat.
Conclusions
The EUROCONTROL requirements set out in ESARR 3 and ESARR 4 set out the overall risk
tolerability framework for ATM in Europe. Although ESARR 4 is very valuable in providing a numeric
target for the ATM contribution to aircraft accidents, it does not provide detailed requirements for the
control of less severe hazards, and individual ATS units must apply it with care and thought to their own
situations. Moreover, ESARR 3 and 4 also call for risk to be reduced ALARP. Proper application of
ESARRs 3 and 4 should therefore result in a European ATM system which is very safe, and equally safe
regardless of which country’s airspace is being traversed.
References
1. EUROCONTROL SRC, ESARR 4, Risk Assessment and Mitigation in ATM, Edition 1.0, 2001.
2. Bundesstelle fur Flugunfalluntersucherung, Investigation Report AX001-1-2/03. Braunschweig,
2004.
3. EUROCONTROL SRC, ESARR 3, Use of Safety Management Systems by ATM Service Providers,
Edition 1.0, 2000.
4. Health and Safety Executive, Reducing Risk, Protecting People: HSE’s Decision Making Process.
London: 2001.
5. Airservices Australia, Airspace Risk Assessment, Class E over Class D Towers, Version 1.0, 2004,
http://www.airservicesaustralia.com/pilotcentre/nas/.
Biography
R. H. Pierce, MSc., Consulting Engineer, CSE International Ltd, Glanford House, Flixborough,
Scunthorpe DN15 8SN, UK. Telephone - +44 1724 862169, facsimile - +44 1724 856256, e-mail –
ron.pierce@cse-international.com.
Mr. Pierce has extensive experience in software engineering topics (compilers, program analysis tools and
software engineering methods). He has over 12 years experience in software and system safety
assessment for industry domains including air traffic management and railway control and signalling
systems. |
|