Next: Completing memory instructions in Up: Processor Memory Unit Previous: Address generation

Issuing instructions to the memory hierarchy

Source files: src/Processor/memunit.cc, src/Processor/memprocess.cc, src/MemSys/cpu.c

Header files: incl/Processor/memory.h, incl/Processor/hash.h, incl/Processor/memprocess.h, incl/MemSys/cpu.h

Every cycle, the simulator calls the IssueMem function. In the case of RC, this function first checks if any outstanding memory fences (MEMBAR instructions) can be broken down - this occurs when every instruction in the class of operations that the fence has been waiting upon has completed. If the processor has support for a consistency implementation with speculative load execution (chosen with ``-K''), all completed speculative loads beyond the voided fence that are no longer blocked for consistency or disambiguation constraints are allowed to leave the memory unit through the PerformMemOp function.

The IssueMem function then seeks to allow the issue of actual loads and stores in the memory system. If the system implements SC or PC, the IssueMems function is called. With RC, IssueStores is called first, followed by IssueLoads. We issue instructions in this order with RC not to favor stores, but rather to favor older instructions (As discussed in Section 10, no store can be marked ready to issue until it is one of the oldest instructions in the active list and all previous instructions have completed).

The functions IssueStores and IssueLoads, or IssueMems for SC and PC systems, scan the appropriate part of the memory unit for instructions that can be issued this cycle. At a bare minimum, the instruction must have passed through address generation and there must be a cache port available for the instruction. The following description focuses on the additional requirements for issuing each type of instruction under each memory consistency model. Steps 1a-1e below refer to the various types of instructions that may be considered available for issue. Step 2 is required for each instruction that actually issues. Step 3 is used only with consistency implementations that include hardware-controlled non-binding prefetching from the instruction window.

Step 1a: Stores in sequential consistency or processor consistency

If the instruction under consideration is a store in SC or PC, it must be the oldest instruction in the memory unit and must have been marked ready in the graduate stage (as described in Section 10.7) before it can issue to the cache. If the processor supports hardware prefetching from the instruction window, then the system can mark a store for a possible hardware prefetch even if it is not ready to issue as a demand access to the caches.

Step 1b: Stores in release consistency

Stores in RC issue after being marked ready, if there are no current ordering constraints imposed by memory fences. If any such constraints are present and if the system has hardware prefetching, the system can mark the store for a possible hardware prefetch. A store can be removed from the memory unit as soon as it issues to the cache, rather than waiting for its completion in the memory hierarchy (as in sequential consistency and processor consistency). When a store is issued to the caches, the processor's StoresToMem field is incremented. However, as we do not currently simulate data in the caches, stores remain in what we call a virtual store buffer. The virtual store buffer is part of the StoreQueue data structure and has a size equivalent to the processor's StoresToMem field. These elements are not counted in the memory unit size, but may be used for obtaining values for later loads.

Step 1c: Loads in sequential consistency

A load instruction in sequential consistency can only issue non-speculatively if it is at the head of the memory unit. If hardware prefetching is enabled, later marked for possible prefetching. If speculative load execution is present, later loads can be issued to the caches. Before issuing such a load, however, the memory unit is checked for any previous stores with an overlapping address. If a store exactly matches the addresses needed by the load, the load value can be forwarded directly from the store. However, if a store address only partially overlaps with the load address, the load will be stalled in order to guarantee that it reads a correct value when it issues to the caches.

Step 1d: Loads in processor consistency

Loads issue in PC under circumstances similar to those of SC. However, a load can issue non-speculatively whenever it is preceded only by store operations. A load that is preceded by store operations must check previous stores for possible forwarding or stalling before it is allowed to issue.

Step 1e: Loads in release consistency

In RC, loads can issue non-speculatively whenever they are not prevented by previous memory barriers. As in SC and PC, a load that is preceded by store operations must check previous stores for possible forwards or stalls. However, in RC, such checks must also take place against the virtual store buffer. As the virtual store buffer is primarily a simulator abstraction, forwards from this buffer are used only to learn the final value of the load; the load itself must issue to the cache as before. However, loads must currently stall in cases of partial overlaps with instructions in the virtual store buffer. This constraint is not expected to hinder performance in applications where most data is either reused (thus keeping data in cache and giving partial overlaps short latencies) or where most pointers are strongly typed (making partial overlaps unlikely). However, if applications do not meet these constraints, it may be more desirable to simulate the actual data in the caches. As in the other models, loads hindered by memory consistency model constraints can be marked for prefetching or speculatively issued if the consistency implementation supports such accesses. Although speculative loads and prefetching are allowed around ordinary MEMBAR instructions, such optimizations are not allowed in the case of fences with the MemIssue field set.

Step 2: Issuing an instruction to the memory hierarchy

For both stores and loads, the IssueOp function actually initiates an access. First, the memprogress field is set to -1 to indicate that this instance is being issued. (In the case of forwards, the memprogress field would have been set to a negative value). This function then consumes a cache port for the access (cache ports are denoted as functional units of type uMEM). The memory_rep function is then called. This function prepares the cache port to free again in the next cycle if this access is not going to be sent to the cache (i.e. if the access is private or if the processor has issued a MEMSYS_OFF directive). Otherwise, the cache is responsible for freeing the cache port explicitly.

Next, the memory_latency function is called. This function starts by calling GetMap, which checks either the processor PageTable or the shared-memory SharedPageTable to determine if this access is a segmentation fault (alignment errors would have already been detected by GetAddr). If the access has a segmentation fault or bus error, its cache port is freed up and the access is considered completed, as the access will not be sent to cache.

If the access does not have any of the previous exceptions, it will now be issued. PREFETCH instructions are considered complete and removed from the memory unit as soon as they are issued. If the access is an ordinary load or store and is not simulated (i.e. either a private access or the processor has turned MEMSYS_OFF), it is set to complete in a single cycle. If the access is simulated, it is sent to the memory hierarchy by calling StartUpMemRef.

StartUpMemRef and the other functions in src/Processor/memprocess.cc are responsible for interfacing between the processor memory unit and the memory hierarchy itself. StartUpMemRef translates the format specified in the instance data structure to a format understood by the cache and memory simulator. This function then calls the function addrinsert to begin the simulation of an access.

addrinsert starts by initializing a memory system request data structure for this memory access. (This data structure type is described in Section 12.2.) Next, the request is inserted into its cache port. If this request fills up the cache ports, then the L1Q_FULL field is set to inform the processor not to issue further requests (this is later cleared by the cache when it processes a request from its ports). After this point, the memory system simulator is responsible for processing this access.

Step 3: Issuing any possible prefetches

After the functions that issue instructions have completed, the memory unit checks to see if any of the possible hardware prefetch opportunities marked in this cycle can be utilized. If there are cache ports available, prefetches are issued for those instructions using IssuePrefetch. These prefetches are sent to the appropriate level of the cache hierarchy, according to the command-line option used.

Next: Completing memory instructions in Up: Processor Memory Unit Previous: Address generation

Vijay Sadananda Pai
Thu Aug 7 14:18:56 CDT 1997