12 Reliability, Availability and Serviceability (RAS)¶

Chapter 12 Reliability, Availability and Serviceability (RAS) Note: In this chapter, the term RAS fault is used to distinguish a fault in the sense defined by the RAS architecture from an SMMU translation fault. The SMMU might support optional RAS features as part of the overall Arm RAS System Architecture, which is described in [11]. The Arm RAS System Architecture: • Describes the following concepts: – Faults, errors, and failures. – Error correction, error propagation, poisoning, and error containment. • Defines the following classifications for errors: – Corrected error (CE). – Deferred error (DE). – Uncorrected errors (UE), which are further classified as and Uncontainable (UC), Unrecoverable (UEU), Recoverable (UER), and Restartable (UEO) errors. • Defines standards for: – Error recovery and fault handling interrupts. – A standard error record, with a memory-mapped interface for a group of error records for one or more components (nodes). Arm recommends that SMMUv3 implementations that include RAS features implement RAS System Architecture, as specified in [11]. When supported, RAS fault handling registers according to the Arm RAS System Architecture are present, into which errors are recorded. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1061

Chapter 12. Reliability, Availability and Serviceability (RAS) If the SMMU does not support RAS but the system does, it is possible for the SMMU to consume an external error which is presented to the SMMU driver software as an external abort. The SMMU might, but is not required to, record errors that it does not itself consume or detect, such as those reported directly from the system to a client device, for example, a device reading a memory location that translates without issue, but ultimately causes the device to consume corrupt data. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1062

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.1. Error propagation, consumption and containment in the SMMU 12.1 Error propagation, consumption and containment in the SMMU This guidance section relates SMMU-specific concepts to the Arm RAS System Architecture. Generally, SMMU activity can be considered to be driven on demand from incoming transactions from client devices. Consequently, an external transaction might cause the SMMU to consume errors from: 1. Internal state errors. 2. External state errors, for example reading a translation table entry, configuration structure or command with an associated error. If the SMMU consumes an error while performing translation for an external transaction, containment of the error can be achieved by deferring the error to the source of the transaction. This might involve terminating the transaction with an abort, or otherwise returning error information to the client. A detected error that is silently propagated might lead to silent data corruption (SDC). The termination of a transaction that could be affected by an SMMU error ensures isolation of the error and represents a non-silent error propagation from the SMMU to the client device. When the SMMU consumes an error from either internal or external state, Arm expects the implementation to report the error and enter error recovery in addition to containment by deferring the error. Note: An example of an error propagation on write is a write of queue records affected by the error. The error might be propagated with affected data. An implementation might non-silently propagate the error so that the system can poison the data in memory. An error can be consumed by the SMMU in a way that can affect external or internal state. This includes: 1. Reading a register containing an error in order to calculate an address for access to a queue entry. 2. Reading a register containing an error in order to write the data of a written queue entry. 3. Consuming an error from the system when reading from the Command queue. 4. Reading a cache entry containing an error in response to an incoming transaction. 5. Reading a register containing an error in response to an incoming transaction. 6. Consuming an error from the system when fetching data into a cache in response to an incoming transaction. Containment of these errors avoids silently affecting external or internal state. The effect of the error means, for the corresponding scenarios in the previous list: 1. Internal consistency is lost. 2. The error might be transient (for example, it is overwritten when the queue entry is written), in which case internal consistency is not lost. The error is deferred into the memory system (error on write) and an implementation might record the error. If the error is non-transient, internal consistency might be lost. 3. An implementation might have unsuccessfully tried to correct the error, or might accept the error. However, internal consistency is not lost. 4. The transaction (and therefore other agents in the system) would be affected, and isolation broken, unless the transaction is terminated. The error can therefore be deferred to the client. SMMU internal consistency might not be lost. 5. An error in a register might not be correctable. The error can be deferred to the client and recorded. Internal consistency might have been lost if the register could affect future transactions. 6. The transaction would be affected by the error, unless the transaction is terminated (deferring the error to the client). SMMU internal consistency might not be lost. If internal consistency is lost, invoking a Service Failure Mode (SFM) can isolate further errors from the system. This might reduce the severity of the error by ceasing subsequent duplicate error reports caused by the same failure, and by reducing the chance that the loss of internal consistency can silently propagate an error. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1063

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.2. Error consumption visible through the SMMU programming interface 12.2 Error consumption visible through the SMMU programming interface On consumption of an external error, an SMMU supporting RAS is expected to report a RAS event through the RAS register interface. Alternatively, consumption of an external error by an SMMU that does not support RAS presents it with an external abort. In both cases, the SMMU records the event through the Event queue (pertaining to the Security state of the transaction that caused the error to be consumed), SMMU_()GERROR or SMMU_ROOT_GPT_CFG_FAR as appropriate: • Events: – F_WALK_EABT when a translation table walk consumes an error. – F_STE_FETCH when an STE fetch consumes an error. – F_CD_FETCH when a CD fetch consumes an error. – F_VMS_FETCH when a VMS fetch consumes an error. • SMMU()GERROR errors: – SMMU()GERROR.CMDQ_ERR triggers and CERROR_ABT is reported in SMMU()CMDQ_CONS.ERR when a command fetch consumes an error. – SMMU()GERROR.PRIQ_ABT_ERR triggers when a PRI queue access is aborted because of an external error. – SMMU()GERROR.EVENTQ_ABT_ERR triggers when an Event queue access is aborted because of an external error. – SMMU()GERROR.DPT_ERR triggers and DPT_EABT is reported in SMMU(R_)DPT_CFG_FAR when a DPT lookup consumes an error. – SMMU_()GERROR.CMDQP_ERR triggers and a CERROR_ABT is reported in SMMU()ECMDQ_CONSn.ERR when a command fetch from an ECMDQ consumes an error. – SMMU()GERROR.DCMDQP_ERR triggers when any of the following apply: * A DCMDQ fetch consumes an error. * The translation required by a DCMDQ transaction consumes an error. Note: This is also recorded as an Event through the Event queue. * The SID translation required by a DCMDQ transaction consumes an error. – SMMU()GERROR.HDBSS_ERR triggers when SMMU()HDBSS_PRODn.ERR == 0b01. – SMMU()GERROR.HACDBS_ERR triggers when SMMU(_)HACDBS_CONS.ERR == 0b01. • SMMU_ROOT_GPT_CFG_FAR errors: – SMMU_ROOT_GPT_CFG_FAR.CFG_ERR == 0x2, SMMU_ROOT_GPT_CFG_FAR.REASON = 0b10 and SMMU_ROOT_GPT_CFG_FAR.FAULTCODE = OTHER_GPF when a GPT entry fetch consumes an error. SMMU_ROOT_GPT_CFG_FAR.FAULTCODE = DCMDQ_GPF when a DCMDQ fetch experiences a GPF. SMMU_ROOT_GPT_CFG_FAR.FAULTCODE = HDBSS_GPF when an access to an HDBSS experiences a GPF. SMMU_ROOT_GPT_CFG_FAR.FAULTCODE = HACDBS_GPF when an access to the HACDBS experiences a GPF. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1064

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.3. Service Failure Mode (SFM) 12.3 Service Failure Mode (SFM) If internal consistency is known to have been lost (for example, detection of a UE in internal register state), or there is a likelihood that it has been lost, and the functionality of the SMMU will be impaired or would risk silent data corruption or silent propagation of errors, Service Failure Mode must be entered. The SMMU must terminate client transactions after this mode is entered, and stop accessing its queues. If an SMMU is in Service Failure Mode (SFM) it responds to PCIe requests as CA wherever possible. Arm recommends that the SMMU registers are still readable in this mode to aid diagnosis. The mechanism to exit or recover from this mode is IMPLEMENTATION DEFINED, but must include system reset. Note: An implementation might have specific isolation features or safety guarantees. For example, a partitioning system in which some client devices are guaranteed to be unaffected by a loss of consistency in a different portion of the SMMU. When a Detected Uncorrected Error occurs in an isolated manner like this, the SMMU does not enter the general Service Failure Mode, and does not raise the SMMU_(S_)GERROR.SFM_ERR error. Entry to Service Failure Mode is signaled by all of the following means: • The Global Errors SMMU_GERROR.SFM_ERR and SMMU_S_GERROR.SFM_ERR are triggered for both Non-secure and Secure programming interfaces (if implemented). • An IMPLEMENTATION DEFINED notification such as recording syndrome into RAS registers and asserting an Error Recovery Interrupt or system-wide error interrupt. Diagnosis of the reason for entering SFM is made through IMPLEMENTATION DEFINED means. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1065

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.4. RAS fault handling/reporting 12.4 RAS fault handling/reporting When RAS facilities are implemented, an implementation must provide at least one group of memory-mapped error recording registers in accordance with the RAS error record format defined in the RAS System Architecture [11]. The exact layout and operation of the RAS registers is IMPLEMENTATION DEFINED, including: • Discovery and identification registers. • The number of RAS error records and association to nodes. • Whether Corrected error counters are implemented. Error Recovery Interrupts and Fault Handling Interrupts must be provided. Note: Interrupts in SMMUv3 are required to be edge-triggered or MSIs. However, interrupts for SMMUv3 RAS features comply with [11] which states it is IMPLEMENTATION DEFINED whether interrupt requests are edge-triggered or level-sensitive. The mechanism for determining whether RAS facilities are implemented, base addresses for RAS registers and the extent of RAS register frames is IMPLEMENTATION DEFINED. Note: For example, IMPLEMENTATION DEFINED identification registers or firmware descriptions. One RAS register interface might be provided for each supported Security state, or a subset of the Security states, subject to the constraints described in the RAS System Architecture [11]. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1066

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.5. Confidential information in RAS Error Records 12.5 Confidential information in RAS Error Records The RAS System Architecture [11] provides implementation requirements for RAS in system components using the Arm RAS architecture. It also introduces the concept of Confidential Data for platforms with FEAT_RME, and requirements on the content that can be recorded in RAS error record registers In an SMMU with RME, the same requirements apply to RAS Error Record registers in that SMMU. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1067

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.6. Recommendations for reporting of SMMU events in RAS registers 12.6 Recommendations for reporting of SMMU events in RAS registers This section is only a recommendation and is not normative. The RAS System Architecture [2] is the authoritative source. 12.6.1 SMMU architectural events In all of the below, the following approach is used: • Reporting of the MV field is not captured here. • Setting the OF field should be performed according to the rules in the RAS System Architecture [2]. • All the error record values in this chapter are captured as though there were no prior errors reported in the error record register. Irrelevant fields are described as “N/A” for “not applicable”. • Implementations must follow the rules in the RAS System Architecture [2] around reporting multiple errors in the same record register. • None of the errors listed here should be critical, so CI == 0 or unchanged in all the below. 12.6.1.1 Deferred error on structure fetch RAS event: SMMU receives a Deferred error (for example, a poisoned read response) on configuration structure or translation table fetch, leading to F_WALK_EABT, F_STE_FETCH, F_CD_FETCH, F_VMS_FETCH. Signaled to requester as: CA for PCIe requests. Equivalent “Abort” on other protocols. Note: The RAS architecture permits a mechanism where the value of ERRCTRL.{UE,RUE,WUE} might suppress signaling of External Aborts. Suppressing these External Aborts is not permitted for the SMMU. Implementation notes: Some SMMUs or memory systems may not support Deferred errors (poison) and therefore cannot detect and report this error. Reported in RAS error record as: ERRSTATUS field Notes AV == 1 If the physical address details for the error are reported in ERRADDR. AV == 0 If ERRADDR is RES0 or not updated. V == 1 The record is valid. UE == 1 There was an Uncorrected error. ER == 1 The SMMU is required to signal the error to the requester as an abort, or CA for PCIe. PN == 1 The SMMU observed the Deferred error. UET == 0b11 The SMMU Signaled the error to the client and it is Recoverable. SERR == 21 The SMMU cannot further defer the error. DE, CE, OF, CI N/A 12.6.1.2 Uncorrectable error on structure fetch RAS event: SMMU detects Uncorrectable (i.e. not Deferred) error on configuration structure or translation table fetch, leading to F_WALK_EABT, F_STE_FETCH, F_CD_FETCH, F_VMS_FETCH. Implementation notes: Some SMMUs may not be able to distinguish a regular External Abort from a RAS error and therefore cannot detect and report this error. Signaled to requester as: CA for PCIe requests. Equivalent “Abort” on other protocols. Note: The RAS architecture permits a mechanism where the value of ERRCTRL.{UE,RUE,WUE} might suppress signaling of External Aborts. Suppressing these External Aborts is not permitted for the SMMU. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1068

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.6. Recommendations for reporting of SMMU events in RAS registers Reported in RAS error record as: ERRSTATUS field Notes AV == 1 If the physical address details for the error are reported in ERRADDR. AV == 0 If ERRADDR is RES0 or not updated. V == 1 The record is valid. UE == 1 There was an Uncorrected error. ER == 1 The SMMU is required to signal the error to the requester as an abort, or CA for PCIe. PN == 0 The SMMU did not observe any poison. UET == 0b11 The SMMU Signaled the error to the client and it is Recoverable. SERR == 12 The error was on data from memory. DE, CE, OF, CI N/A 12.6.1.3 Error on Command queue fetch RAS event: SMMU experiences a RAS error on a CMDQ fetch, leading to SMMU_()GERROR.CMDQ_ERR with CERROR_ABT reported in SMMU(_)CMDQ_CONS.ERR. Implementation notes: Some SMMUs may not be able to distinguish a regular External Abort from a RAS error and therefore cannot detect and report this error. Reported in RAS error record as: ERRSTATUS field Notes AV == 1 If the physical address details for the error are reported in ERRADDR. AV == 0 If ERRADDR is RES0 or not updated. V == 1 The record is valid. UE == 1 There was an Uncorrected error. ER == 0 The error was not signaled as an External Abort by the SMMU. PN == 0 or 1 Depending on whether the SMMU saw corrupt data vs poisoned response. UET == 0b11 The SMMU Signaled the error to the client and it is Recoverable. SERR IN {12, 21} 12 corrupted data, 21 for poisoned data. DE, CE, OF, CI N/A 12.6.2 Common SMMU microarchitectural events 12.6.2.1 ECC or EDC error on TLB or configuration cache RAS event: SMMU needs to use a TLB or Configuration Cache entry but detects that it has been corrupted. This is a Latent error as it does not need to be consumed. In the case of an ECC error, the SMMU corrects the entry. In the case of an EDC error, the SMMU invalidates the entry and performs a fresh configuration fetch or translation table walk. Implementation notes: Some SMMUs may not have ECC or EDC on TLBs or configuration caches and therefore could never detect this error. ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1069

Chapter 12. Reliability, Availability and Serviceability (RAS) 12.6. Recommendations for reporting of SMMU events in RAS registers Reported in RAS error record as: ERRSTATUS field Notes AV == 0 The entry is not associated with an address any more. V == 1 The record is valid. ER == 0 No error to signal. CE != 0b00 The error was corrected. SERR IN {1, 6, 7, 8, 9} As appropriate. UE, DE, OF, UET, CI, PN N/A 12.6.2.2 Error on data payload of a client transaction RAS event: SMMU observes corruption on the data payload of a transaction Implementation notes: Some SMMUs may not have visibility of the data payload at all and therefore cannot detect this error. Even if an SMMU can detect this error, it is IMPLEMENTATION DEFINED whether it reports it. Some SMMUs or memory systems may not support poison. An example range of implementation styles is listed in the second table below. Reported in RAS error record as: ERRSTATUS field Notes AV == 1 If the physical address details for the error are reported in ERRADDR. AV == 0 If ERRADDR is RES0 or not updated. V == 1 The record is valid. CI == 0 The error is localized and SMMU can continue operation. CE, OF, CI N/A Scenario SMMU action ERRSTATUS values SERR SMMU does not observe data path None Not reported in SMMU RAS registers. N/A SMMU ignores poison on client transactions None Not reported in SMMU RAS registers. N/A Data is poisoned before it arrives at SMMU Upgrade to Abort UE == 1, ER == 1, DE N/A, PN == 1, UET == 0b11 10 Data is poisoned before it arrives at SMMU Propagate poison UE N/A, ER == 0, DE == 1, PN == 1 10, 23, 24 Data corrupted in SMMU data buffer Upgrade to Abort UE == 1, ER == 1, DE N/A, PN == 0, UET == 0b11 2 Data corrupted in SMMU data buffer Propagate poison UE N/A, ER == 0, DE == 1, PN == 0 2 ARM IHI 0070 H.a Copyright © 2016-2026 Arm Limited or its affiliates. All rights reserved. Non-confidential 1070