The first 32-bit CPU I ever programmed (well.. I had been programming 36 bit ones earlier) had a user level instruction SOLO, disabling all interrupts for up wto 255 clock cycles, or until a TUTTI instruction was executed. If no TUTTI was execeuted, a non-maskable exception was raised in the running process on the 256th clock cycle. Each of these instructions took one clock cycle to execute.
You can do quite a lot in 255 clock cycles (most non-FP instructions were single cycle on this CPU). If the entire update couldn't be completed in that time, you could certainly test and set a semaphore (even maintain a queue for it). Not involving any sort of privilege change and no MMS related changes, just two single-cycle instructions, made a very low-cost (in terms of CPU load) synchronization mechanism. It is a pity that instructions like SOLO / TUTTI are not available on every machine architecture today.
(For the curious ones: I am talking about a machine called the ND-500, introduced to the market in 1980. Obviously, I/O was handled by DMA and interrupt handlers which could buffer input for 256 cycles or more; low latency interruptw were provided through programmable controllers, so no system component were dependent on the CPU giving them attention in less than 256 clock cycles.)