Machine-check exception

From ArchWiki

This article aims to help users implement services to actively monitor, log, and report hardware errors. A machine check exception (MCE) is an error generated by the CPU when the CPU detects that a hardware error or failure has occurred.

Machine check exceptions (MCEs) can occur for a variety of reasons ranging from undesired or out-of-spec voltages from the power supply, from cosmic radiation flipping bits in memory DIMMs or the CPU, or from other miscellaneous faults, including faulty software triggering hardware errors.

Installation

Install the rasdaemonAUR package. rasdaemon written by Mauro Carvalho Chehab is one of the tools to gather MCE information.

Previously, the task was performed by the mcelog package. However, it has been deprecated, and Arch kernels are not even compiled with the necessary configuration option CONFIG_X86_MCELOG_LEGACY (FS#55657) now.

Configuration

There are two systemd services that need to be started and enabled. ras-mc-ctl.service registers DIMM labels (from /etc/ras/dimm_labels.d/) with EDAC drivers. On consumer-grade motherboards it usually logs a No dimm labels for <motherboard model> error and does nothing. rasdaemon.service runs as a daemon and logs RAS events to systemd journal.

See ras-mc-ctl(8) and rasdaemon(1) for more information.

Usage

You can use ras-mc-ctl --error-count and ras-mc-ctl --summary to quickly glance at the recorded errors. Errors are logged to the journal as well as the sqlite database at /var/lib/rasdaemon/ras-mc_event.db.

See also

Hardware documentation