Hello,
I have identified a major deficiency in the x86 kernel that requires minor overhaul of
multiple low-level components, and stems from poorly understood implementation details of
the x86 architecture, which is NMI support.
NMIs are Non Maskable Interrupts, similar to SMIs (generated for SMM) mode but typically
used for critical hardware errors (somewhat a precursor to MCEs).
Most modern operating systems support NMI as a debugging tool: when a system is deadlocked
due to interrupt issues or perhaps in a state with interrupts disabled, it is often hard
to "break-in" the system to analyze the issue. They are also used as a last
resort to terminate the system in case of major hardware error (such as power issues or
parity errors).
NMIs can also be generated by external hardware:
This simple circuit generates a tri-state one-cycle SERR# pulse on the PCI bus, which
causes an NMI. It can be used as an emergency dump switch, when other methods fail.
Â
The following issues exist in ReactOS that hinder NMI support:
Â
- The I/O Privilege Map (IOPM) configuration is done dangerously and incorrectly. A number
of misunderstood hardcoded values are used throughout the code, assumptions are made on
the number of IOPMs, and IOPM switching is done very poorly during BIOS Calls.
Â
- BIOS Calls are currently executing on the TSS context of the current state. This works
fine with the normal KGDT_TSS that NTOS executes on, but causes dangerous errors on
scenarios such as Double Faults (KGDT_DF_TSS) and NMIs (KGDT_NMI_TSS). These TSS segments
do not have an IOPM allocated, which causes memory corruption when the BIOS Call code
attempts to save and restore the IOPM by assuming it's there. It also causes BIOS code
to fail during execution; after tracing with an IDP we discovered that BIOS I/O Port
accesses were generating exceptions, which turned out to be due to the fact the BIOS was
reading the bogus, non-existing IOPM and thus failing to validate I/O Port access. This is
currently a problem in ReactOS as a double-fault trap will trigger massive corruption, as
the panic code will attempt to draw the "Blue screen of death", requiring a
Video ROM Interrupt 10h through a BIOS Call, which will fail as explained. In an NMI case,
the same scenario would also happen.
Â
- The NMI trap code is not yet implemented.
Â
- The KeRegisterNmiCallback and KeDeregisterNmiCallback routines are not yet implemented.
Â
- KPRCB Context-switching is not yet implemented, along with related routines. Only the
high-level routines used during debug traps are implemented, but not the support required
for resuming after an NMI.
Â
- HalHandleNMI is subject to recursive NMI scenarios.
Â
- BIOS Calls do excessive TLB flushing.
Â
- The logic for IDT write-protection during BIOS Calls is overcomplicated. The IDT should
always be made read-only and restore to its previous state.
Â
- TLB flushing in the HAL appears to be broken when global pages are used. Additionally,
the same problem exists in NTOS -- there is no support for TLB flushing when Global Pages
are used, even though Global Page support is enabled in the MMU and the bit is used on
kernel PTEs. This leads to either over-flushing global pages during context switching,
which shouldn't be done, or non-flushing of global pages, when they should be
flushed.
Â
- Tangential issue: I have written a new UNIMPLEMENTED_PATH macro that now describes the
exact path that was touched. Previously only the PC was given, which makes it nearly
impossible to connect to the line of source causing the issue, especially for non
developers. This new macro outputs a string reason. Additionally, an
UNIMPLEMENTED_V86_PATH is used for scenarios where the path is only expected in VDM/V8086
scenarios, to differentiate from unlikely paths in normal execution flows.
Â
These issues have all been fixed in my toilet. All this work stemmed from doing some
testing of the new ARM3 section code written recently (never debug other people's
code!), which led to significant debugging pains without NMI support. It has nothing to do
with the ARM port but since I've written it, I might as well pass it on instead of
keeping it locally for eternity.
Â
Thoughts/comments?
Â
-r
Show replies by date
Hi,
This is not related to NMI, just to remind what is already reported in
bugzilla. The idle thread stack can grow till it damages kernel data.
Neither there is an unmapped page at the end, nor read-only segments
are read-only.