Virtual memory-mapped communication was developed out of the need for a basic multicomputer communication mechanism with extremely low latency and high bandwidth. This is achieved by allowing applications to transfer data directly between two virtual memory address spaces over the network. The basic mechanism is designed to efficiently support applications and common communication models such as message passing, shared memory, RPC, and client-server. The VMMC mechanism consists of several calls to support user-level buffer management, various data transfer strategies, and transfer of control.
In the VMMC model, an import-export mapping must be establishedbefore communication begins. A receiving process can export aregion of its address space as a receive buffer together with a set ofpermissions to define access rights for the buffer. In order to senddata to an exported receive buffer, a user process must import the buffer with the right permissions. After successful imports, a sender can transfer data from its virtual memory into imported receive buffers at user-level without further protection checking or protection domain crossings. Communication under this import-export mapping mechanism is protected in two ways.First, a trusted third party such as the operating system kernel or a trusted process implements import and export operations. Second, the hardware MMU on an importing node makes sure that transferred data cannot overwrite memory outside a receive buffer.
The VMMC model defines two user-level transfer strategies: deliberate update and automatic update. Deliberate update is anexplicit transfer of data from a sender's memory to a receiver'smemory.
In order to use automatic update, a sender binds a portion of its address space to an imported receive buffer, creating an automatic update binding between the local and remote memory. All writes performed to the local memory are automatically performed to the remote memory as well, eliminating the need for an explicit send operation.
An important distinction between these two transfer strategies is that under automatic update, local memory is "bound" to a receive buffer at the time a mapping is created, while under deliberate update the binding does not occur until an explicit send command is issued.
Automatic update is optimized for low latency, and deliberate updateis designed for flexible import-export mappings and for reducingnetwork traffic.Automatic update is implemented by having the SHRIMP-II network interface hardware snoop all writes on the memory bus. If the write is to an address that has an automatic update binding, the hardware builds a packet containing the destination address and the written value, and sends it to the destination node. The hardware can combine writes to consecutive locations into a single packet.
Deliberate update is implemented by having a user-level program execute a sequence of two accesses to addresses which are decoded by the SHRIMP-II network interface board on the node's expansion bus (the EISAbus). These accesses specify the source address, destination address, and size of a transfer. The ordinary virtual memory protection mechanisms (MMU and page tables) are used to maintain protection.
VMMC guarantees the in-order, reliable delivery of all data transfers, provided that the ordinary, blocking version of the deliberate-updatesend operation is used. The ordering guarantees are a bit more complicated when the non-blocking deliberate-update send operation isused, but we omit a detailed discussion of this point because none ofthe programs we will describe use this non-blocking operation.
The VMMC model does not include any buffer management since data is transferred directly between user-level address spaces. This gives applications the freedom to utilize as little buffering and copying asneeded. The model directly supports zero-copy protocols when both thesend and receive buffers are known at the time of a transfer initiation.
The VMMC model assumes that receive buffer addresses are specified bythe sender, and received data is transferred directly to memory.Hence, there is no explicit receive operation. CPU involvement inreceiving data can be as little as checking a flag, although a hardware notification mechanism is also supported.
The notification mechanism is used to transfer control to areceiving process, or to notify the receiving process about externalevents. It consists of a message transfer followed by an invocation ofa user-specified, user-level handler function. The receiving processcan associate a separate handler function with each exported buffer,and notifications only take effect when a handler has been specified.
Notifications are similar to UNIX signals in that they can be blocked and unblocked, they can be accepted or discarded (on a per-bufferbasis), and a process can be suspended until a particular notificationarrives. Unlike signals, however, notifications are queued whenblocked. Our current implementation of notifications uses signals, butwe expect to reimplement notifications in a way similar to activemessages, with performance much better than signals in the commoncase.
VMMC on Myrinet
VMMC has been designed and implemented for the SHRIMP multicomputer where it delivers user-to-user latency and bandwidth close to the limits imposed by the underlying hardware.
The implementation of VMMC on a Myrinet network of PCI-based PCs (uniprocessors and SMPs) aims at determining whether the benefits of VMMC can be realized on the new hardware and to investigate network interface design tradeoffs by comparing SHRIMP with Myrinet and its respective VMMC implementation.
Our Myrinet implementation of VMMC achieves about 10 microseconds one-way latency and provides more than 100 MBytes/sec user-to-user bandwidth. Compared to SHRIMP, the Myrinet implementation of VMMC incurs relatively higher overhead and demands more network interface resources (LANai processor, on-board SRAM) but requires less operating system support.