我们知道任何机器运行都是依赖内存的,通常情况下我们不应该怀疑内存的硬件问题,但在RAS领域上不怀疑是不应该的,对于内存而言,其实很容易出现各类的问题,例如内存大面积损坏,内存单bit翻转等。本文不讨论内存的大面积破坏的问题,因为这已经是不可修复的大缺陷了。这里讨论一种情况,那就是内存的单bit翻转导致的数据不正确时在aarch64系列芯片上的硬件和软件措施
Parity也就是奇偶校验,非常早期的单片机设备总线通信例如spi等,会用到这个,这个相信大家有过介绍和理解,这里重复一下。
奇偶校验就是在一组数据上,新增一个校验位,这个校验位用于计算1的个数,如果1的个数是奇数,则是1,如果偶数,则是0。
假设我们在传输数据时,某个bit发生了翻转现象,那么我们的校验位就能识别出来。
ECC也叫Error-Correcting Code memory,我们知道Parity在简单的数据通讯中能够提示部分错误,但是不能主动回复错误,那么ECC就是一种能够恢复位翻转错误的一种硬件技术,当代内存颗粒基本上都具备ECC校验的基本功能。 ECC有多种纠错算法。这里简单列举一下:
当我们了解了对于内存领域常见的硬件纠错方案之后,我们也需要知道软件是如何处理和规范解决这种ECC错误的
软件的方案在arm架构上主要有两点:
在arm中,对于内存的这类错误有一个单独的概念叫做ESB,他能够记录内存的同步错误。
arm规范中,ESB如下描述:
可以理解到,ESB是arm规范中作为错误同步屏障记录在特殊寄存器DISR(Deferred Interrupt Status Register)上并通过EL1层和EL2层上才能获取。
ESB的状态需要架构打开RAS扩展,否则作为空指令执行。
对于ECC/Parity错误,在arm中默认是通过mm的fault来接受的,流程如下:
首先我们注意异常向量表如下:
SYM_CODE_START(vectors) kernel_ventry 1, sync_invalid // Synchronous EL1t kernel_ventry 1, irq_invalid // IRQ EL1t kernel_ventry 1, fiq_invalid // FIQ EL1t kernel_ventry 1, error_invalid // Error EL1t kernel_ventry 1, sync // Synchronous EL1h kernel_ventry 1, irq // IRQ EL1h kernel_ventry 1, fiq_invalid // FIQ EL1h kernel_ventry 1, error // Error EL1h kernel_ventry 0, sync // Synchronous 64-bit EL0 kernel_ventry 0, irq // IRQ 64-bit EL0 kernel_ventry 0, fiq_invalid // FIQ 64-bit EL0 kernel_ventry 0, error // Error 64-bit EL0 #ifdef CONFIG_COMPAT kernel_ventry 0, sync_compat, 32 // Synchronous 32-bit EL0 kernel_ventry 0, irq_compat, 32 // IRQ 32-bit EL0 kernel_ventry 0, fiq_invalid_compat, 32 // FIQ 32-bit EL0 kernel_ventry 0, error_compat, 32 // Error 32-bit EL0 #else kernel_ventry 0, sync_invalid, 32 // Synchronous 32-bit EL0 kernel_ventry 0, irq_invalid, 32 // IRQ 32-bit EL0 kernel_ventry 0, fiq_invalid, 32 // FIQ 32-bit EL0 kernel_ventry 0, error_invalid, 32 // Error 32-bit EL0 #endif SYM_CODE_END(vectors)
我们这里以el0的sync异常为例,因为内存的同步异常通过sync来触发,如下:
kernel_ventry 0, sync // Synchronous 64-bit EL0
此时对于的函数如下:
SYM_CODE_START_LOCAL_NOALIGN(el0_sync) kernel_entry 0 mov x0, sp bl el0_sync_handler b ret_to_user SYM_CODE_END(el0_sync)
这里发现会跳转到函数el0_sync_handler,其实现如下:
asmlinkage void noinstr el0_sync_handler(struct pt_regs *regs) { unsigned long esr = read_sysreg(esr_el1); switch (ESR_ELx_EC(esr)) { case ESR_ELx_EC_SVC64: el0_svc(regs); break; case ESR_ELx_EC_DABT_LOW: el0_da(regs, esr); break; case ESR_ELx_EC_IABT_LOW: el0_ia(regs, esr); break; case ESR_ELx_EC_FP_ASIMD: el0_fpsimd_acc(regs, esr); break; case ESR_ELx_EC_SVE: el0_sve_acc(regs, esr); break; case ESR_ELx_EC_FP_EXC64: el0_fpsimd_exc(regs, esr); break; case ESR_ELx_EC_SYS64: case ESR_ELx_EC_WFx: el0_sys(regs, esr); break; case ESR_ELx_EC_SP_ALIGN: el0_sp(regs, esr); break; case ESR_ELx_EC_PC_ALIGN: el0_pc(regs, esr); break; case ESR_ELx_EC_UNKNOWN: el0_undef(regs); break; case ESR_ELx_EC_BTI: el0_bti(regs); break; case ESR_ELx_EC_BREAKPT_LOW: case ESR_ELx_EC_SOFTSTP_LOW: case ESR_ELx_EC_WATCHPT_LOW: case ESR_ELx_EC_BRK64: el0_dbg(regs, esr); break; case ESR_ELx_EC_FPAC: el0_fpac(regs, esr); break; default: el0_inv(regs, esr); } }
我们留意data abort error,所以关心如下:
case ESR_ELx_EC_DABT_LOW: el0_da(regs, esr); break;
其函数如下
static void noinstr el0_da(struct pt_regs *regs, unsigned long esr) { unsigned long far = read_sysreg(far_el1); enter_from_user_mode(); local_daif_restore(DAIF_PROCCTX); do_mem_abort(far, esr, regs); }
我们看看跳转函数do_mem_abort的实现
void do_mem_abort(unsigned long far, unsigned int esr, struct pt_regs *regs) { const struct fault_info *inf = esr_to_fault_info(esr); unsigned long addr = untagged_addr(far); if (!inf->fn(far, esr, regs)) return; if (!user_mode(regs)) { pr_alert("Unhandled fault at 0x%016lx\n", addr); trace_android_rvh_do_mem_abort(regs, esr, addr, inf->name); mem_abort_decode(esr); show_pte(addr); } /* * At this point we have an unrecognized fault type whose tag bits may * have been defined as UNKNOWN. Therefore we only expose the untagged * address to the signal handler. */ arm64_notify_die(inf->name, regs, inf->sig, inf->code, addr, esr); }
这里留意函数esr_to_fault_info,如下:
static inline const struct fault_info *esr_to_fault_info(unsigned int esr) { return fault_info + (esr & ESR_ELx_FSC); }
所以我们应该关注这个核心的数组fault_info,如下:
static const struct fault_info fault_info[] = { { do_bad, SIGKILL, SI_KERNEL, "ttbr address size fault" }, { do_bad, SIGKILL, SI_KERNEL, "level 1 address size fault" }, { do_bad, SIGKILL, SI_KERNEL, "level 2 address size fault" }, { do_bad, SIGKILL, SI_KERNEL, "level 3 address size fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 0 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 1 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 2 translation fault" }, { do_translation_fault, SIGSEGV, SEGV_MAPERR, "level 3 translation fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 8" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 access flag fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 access flag fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 access flag fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 12" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 1 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 2 permission fault" }, { do_page_fault, SIGSEGV, SEGV_ACCERR, "level 3 permission fault" }, { do_sea, SIGBUS, BUS_OBJERR, "synchronous external abort" }, { do_tag_check_fault, SIGSEGV, SEGV_MTESERR, "synchronous tag check fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 18" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 19" }, { do_sea, SIGKILL, SI_KERNEL, "level 0 (translation table walk)" }, { do_sea, SIGKILL, SI_KERNEL, "level 1 (translation table walk)" }, { do_sea, SIGKILL, SI_KERNEL, "level 2 (translation table walk)" }, { do_sea, SIGKILL, SI_KERNEL, "level 3 (translation table walk)" }, { do_sea, SIGBUS, BUS_OBJERR, "synchronous parity or ECC error" }, // Reserved when RAS is implemented { do_bad, SIGKILL, SI_KERNEL, "unknown 25" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 26" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 27" }, { do_sea, SIGKILL, SI_KERNEL, "level 0 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_sea, SIGKILL, SI_KERNEL, "level 1 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_sea, SIGKILL, SI_KERNEL, "level 2 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_sea, SIGKILL, SI_KERNEL, "level 3 synchronous parity error (translation table walk)" }, // Reserved when RAS is implemented { do_bad, SIGKILL, SI_KERNEL, "unknown 32" }, { do_alignment_fault, SIGBUS, BUS_ADRALN, "alignment fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 34" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 35" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 36" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 37" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 38" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 39" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 40" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 41" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 42" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 43" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 44" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 45" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 46" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 47" }, { do_bad, SIGKILL, SI_KERNEL, "TLB conflict abort" }, { do_bad, SIGKILL, SI_KERNEL, "Unsupported atomic hardware update fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 50" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 51" }, { do_bad, SIGKILL, SI_KERNEL, "implementation fault (lockdown abort)" }, { do_bad, SIGBUS, BUS_OBJERR, "implementation fault (unsupported exclusive)" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 54" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 55" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 56" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 57" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 58" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 59" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 60" }, { do_bad, SIGKILL, SI_KERNEL, "section domain fault" }, { do_bad, SIGKILL, SI_KERNEL, "page domain fault" }, { do_bad, SIGKILL, SI_KERNEL, "unknown 63" }, };
这里我们留意ECC和Parity错误,如下:
{ do_sea, SIGBUS, BUS_OBJERR, "synchronous parity or ECC error" }, // Reserved when RAS is implemented
到这里,我们知道了常见的ECC/Parity错误会触发到软件的do_sea,这里我们重点开始关心软件上接受错误了是如何的行为,所以留意arm64_notify_die函数
void arm64_notify_die(const char *str, struct pt_regs *regs, int signo, int sicode, unsigned long far, int err) { if (user_mode(regs)) { WARN_ON(regs != current_pt_regs()); current->thread.fault_address = 0; current->thread.fault_code = err; arm64_force_sig_fault(signo, sicode, far, str); } else { die(str, regs, err); } }
这里可以看到区分了用户空间和内核空间
用户空间调用的是arm64_force_sig_fault,这里可以发现其发送了SIGBUS的错误
void arm64_force_sig_fault(int signo, int code, unsigned long far, const char *str) { arm64_show_signal(signo, str); if (signo == SIGKILL) force_sig(SIGKILL); else force_sig_fault(signo, code, (void __user *)far); }
force_sig_fault已经到信号的实现核心函数上了,这里不做解析了。
而内核空间则调用了die,这里直接oops了,如果打开了panic,则panic了。
void die(const char *str, struct pt_regs *regs, int err) { oops_exit(); if (in_interrupt()) panic("%s: Fatal exception in interrupt", str); if (panic_on_oops) panic("%s: Fatal exception", str); }
至此,我们可以发现,如果系统发生了ECC错误,那么会通过同步异常给到aarch64芯片,我们以el0为例,该错误会通过异常向量表给到do_sea函数,此函数会根据ecc的内存错误发生地方判断是否在用户空间,如果是用户空间,则通过bus error终结程序,如果是内核空间,则发送oops。
SDEI是arm架构提出来的一套软件处理接口,我们从全称就可以了解Software Delegated Exception interface。它的逻辑是通过在非安全事件注册回调,
SDEI在spec中描述的实现在安全世界。其流程如下:
SDEI会定义一系列的交互方式,如下:
这里描述了SDEI handler的交互过程。
我们关注trampoline如下:
SYM_CODE_START(__sdei_asm_entry_trampoline) mrs x4, ttbr1_el1 tbz x4, #USER_ASID_BIT, 1f tramp_map_kernel tmp=x4 isb mov x4, xzr /* * Use reg->interrupted_regs.addr_limit to remember whether to unmap * the kernel on exit. */ 1: str x4, [x1, #(SDEI_EVENT_INTREGS + S_ORIG_ADDR_LIMIT)] tramp_data_read_var x4, __sdei_asm_handler br x4 SYM_CODE_END(__sdei_asm_entry_trampoline)
其实现如下:
/* * Software Delegated Exception entry point. * * x0: Event number * x1: struct sdei_registered_event argument from registration time. * x2: interrupted PC * x3: interrupted PSTATE * x4: maybe clobbered by the trampoline * * Firmware has preserved x0->x17 for us, we must save/restore the rest to * follow SMC-CC. We save (or retrieve) all the registers as the handler may * want them. */ SYM_CODE_START(__sdei_asm_handler) stp x2, x3, [x1, #SDEI_EVENT_INTREGS + S_PC] stp x4, x5, [x1, #SDEI_EVENT_INTREGS + 16 * 2] stp x6, x7, [x1, #SDEI_EVENT_INTREGS + 16 * 3] stp x8, x9, [x1, #SDEI_EVENT_INTREGS + 16 * 4] stp x10, x11, [x1, #SDEI_EVENT_INTREGS + 16 * 5] stp x12, x13, [x1, #SDEI_EVENT_INTREGS + 16 * 6] stp x14, x15, [x1, #SDEI_EVENT_INTREGS + 16 * 7] stp x16, x17, [x1, #SDEI_EVENT_INTREGS + 16 * 8] stp x18, x19, [x1, #SDEI_EVENT_INTREGS + 16 * 9] stp x20, x21, [x1, #SDEI_EVENT_INTREGS + 16 * 10] stp x22, x23, [x1, #SDEI_EVENT_INTREGS + 16 * 11] stp x24, x25, [x1, #SDEI_EVENT_INTREGS + 16 * 12] stp x26, x27, [x1, #SDEI_EVENT_INTREGS + 16 * 13] stp x28, x29, [x1, #SDEI_EVENT_INTREGS + 16 * 14] mov x4, sp stp lr, x4, [x1, #SDEI_EVENT_INTREGS + S_LR] mov x19, x1 /* Store the registered-event for crash_smp_send_stop() */ ldrb w4, [x19, #SDEI_EVENT_PRIORITY] cbnz w4, 1f adr_this_cpu dst=x5, sym=sdei_active_normal_event, tmp=x6 b 2f 1: adr_this_cpu dst=x5, sym=sdei_active_critical_event, tmp=x6 2: str x19, [x5] #ifdef CONFIG_VMAP_STACK /* * entry.S may have been using sp as a scratch register, find whether * this is a normal or critical event and switch to the appropriate * stack for this CPU. */ cbnz w4, 1f ldr_this_cpu dst=x5, sym=sdei_stack_normal_ptr, tmp=x6 b 2f 1: ldr_this_cpu dst=x5, sym=sdei_stack_critical_ptr, tmp=x6 2: mov x6, #SDEI_STACK_SIZE add x5, x5, x6 mov sp, x5 #endif #ifdef CONFIG_SHADOW_CALL_STACK /* Use a separate shadow call stack for normal and critical events */ cbnz w4, 3f ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_normal_ptr, tmp=x6 b 4f 3: ldr_this_cpu dst=scs_sp, sym=sdei_shadow_call_stack_critical_ptr, tmp=x6 4: #endif /* * We may have interrupted userspace, or a guest, or exit-from or * return-to either of these. We can't trust sp_el0, restore it. */ mrs x28, sp_el0 ldr_this_cpu dst=x0, sym=__entry_task, tmp=x1 msr sp_el0, x0 /* If we interrupted the kernel point to the previous stack/frame. */ and x0, x3, #0xc mrs x1, CurrentEL cmp x0, x1 csel x29, x29, xzr, eq // fp, or zero csel x4, x2, xzr, eq // elr, or zero stp x29, x4, [sp, #-16]! mov x29, sp add x0, x19, #SDEI_EVENT_INTREGS mov x1, x19 bl __sdei_handler msr sp_el0, x28 /* restore regs >x17 that we clobbered */ mov x4, x19 // keep x4 for __sdei_asm_exit_trampoline ldp x28, x29, [x4, #SDEI_EVENT_INTREGS + 16 * 14] ldp x18, x19, [x4, #SDEI_EVENT_INTREGS + 16 * 9] ldp lr, x1, [x4, #SDEI_EVENT_INTREGS + S_LR] mov sp, x1 mov x1, x0 // address to complete_and_resume /* x0 = (x0 <= 1) ? EVENT_COMPLETE:EVENT_COMPLETE_AND_RESUME */ cmp x0, #1 mov_q x2, SDEI_1_0_FN_SDEI_EVENT_COMPLETE mov_q x3, SDEI_1_0_FN_SDEI_EVENT_COMPLETE_AND_RESUME csel x0, x2, x3, ls ldr_l x2, sdei_exit_mode /* Clear the registered-event seen by crash_smp_send_stop() */ ldrb w3, [x4, #SDEI_EVENT_PRIORITY] cbnz w3, 1f adr_this_cpu dst=x5, sym=sdei_active_normal_event, tmp=x6 b 2f 1: adr_this_cpu dst=x5, sym=sdei_active_critical_event, tmp=x6 2: str xzr, [x5] alternative_if_not ARM64_UNMAP_KERNEL_AT_EL0 sdei_handler_exit exit_mode=x2 alternative_else_nop_endif #ifdef CONFIG_UNMAP_KERNEL_AT_EL0 tramp_alias dst=x5, sym=__sdei_asm_exit_trampoline, tmp=x3 br x5 #endif SYM_CODE_END(__sdei_asm_handler) NOKPROBE(__sdei_asm_handler)
这里我们关注其跳转如下:
bl __sdei_handler
其实现如下:
asmlinkage noinstr unsigned long __sdei_handler(struct pt_regs *regs, struct sdei_registered_event *arg) { unsigned long ret; arm64_enter_nmi(regs); ret = _sdei_handler(regs, arg); arm64_exit_nmi(regs); return ret; }
对于_sdei_handler,会按照SDEI协议的event handler去处理,其函数如下:
static __kprobes unsigned long _sdei_handler(struct pt_regs *regs, struct sdei_registered_event *arg) { u32 mode; int i, err = 0; int clobbered_registers = 4; u64 elr = read_sysreg(elr_el1); u32 kernel_mode = read_sysreg(CurrentEL) | 1; /* +SPSel */ unsigned long vbar = read_sysreg(vbar_el1); if (arm64_kernel_unmapped_at_el0()) clobbered_registers++; /* Retrieve the missing registers values */ for (i = 0; i < clobbered_registers; i++) { /* from within the handler, this call always succeeds */ sdei_api_event_context(i, ®s->regs[i]); } /* * We didn't take an exception to get here, set PAN. UAO will be cleared * by sdei_event_handler()s force_uaccess_begin() call. */ __uaccess_enable_hw_pan(); err = sdei_event_handler(regs, arg); if (err) return SDEI_EV_FAILED; if (elr != read_sysreg(elr_el1)) { /* * We took a synchronous exception from the SDEI handler. * This could deadlock, and if you interrupt KVM it will * hyp-panic instead. */ pr_warn("unsafe: exception during handler\n"); } mode = regs->pstate & (PSR_MODE32_BIT | PSR_MODE_MASK); /* * If we interrupted the kernel with interrupts masked, we always go * back to wherever we came from. */ if (mode == kernel_mode && !interrupts_enabled(regs)) return SDEI_EV_HANDLED; /* * Otherwise, we pretend this was an IRQ. This lets user space tasks * receive signals before we return to them, and KVM to invoke it's * world switch to do the same. * * See DDI0487B.a Table D1-7 'Vector offsets from vector table base * address'. */ if (mode == kernel_mode) return vbar + 0x280; else if (mode & PSR_MODE32_BIT) return vbar + 0x680; return vbar + 0x480; }
这里我们关注函数sdei_event_handler,此时函数是acpi/fdt实现的firmware驱动,如下
int sdei_event_handler(struct pt_regs *regs, struct sdei_registered_event *arg) { int err; mm_segment_t orig_addr_limit; u32 event_num = arg->event_num; /* * Save restore 'fs'. * The architecture's entry code save/restores 'fs' when taking an * exception from the kernel. This ensures addr_limit isn't inherited * if you interrupted something that allowed the uaccess routines to * access kernel memory. * Do the same here because this doesn't come via the same entry code. */ orig_addr_limit = force_uaccess_begin(); err = arg->callback(event_num, regs, arg->callback_arg); if (err) pr_err_ratelimited("event %u on CPU %u failed with error: %d\n", event_num, smp_processor_id(), err); force_uaccess_end(orig_addr_limit); return err; } NOKPROBE_SYMBOL(sdei_event_handler);
接下来的流程,就完全符合SDEI定义的交互流程了。
回顾了这些代码,在学习sdei的时候,意外发现一个仓库rasdaemon,此仓库是目的是通过一个上层的程序,捕获常见的ras领域的错误,当然也包括我们的内存单bit翻转的错误。
对于rasdaemon,可以将内存的错误数量进行统计。提供给用户查看。
不幸的是,此工具不在aarch64上实现,我们但是我们在amd上可以看到如下实现:
parse_amd_smca_event--->decode_smca_error
我们随便以smca_mce_descs中的一种desc描述示例如下:
static const char * const smca_smu2_mce_desc[] = { "High SRAM ECC or parity error", "Low SRAM ECC or parity error", "Data Cache Bank A ECC or parity error", "Data Cache Bank B ECC or parity error", "Data Tag Cache Bank A ECC or parity error", "Data Tag Cache Bank B ECC or parity error", "Instruction Cache Bank A ECC or parity error", "Instruction Cache Bank B ECC or parity error", "Instruction Tag Cache Bank A ECC or parity error", "Instruction Tag Cache Bank B ECC or parity error", "System Hub Read Buffer ECC or parity error", "PHY RAS ECC Error", [12 ... 57] = "Reserved", "A correctable error from a GFX Sub-IP", "A fatal error from a GFX Sub-IP", "Reserved", "Reserved", "A poison error from a GFX Sub-IP", "Reserved", };
可以发现,其desc能够捕获ECC和Parity error。
但是鉴于自己没有对应的机器实践,rasdaemon并没有尝试验证。