Next: Branch prediction Up: RSIM_EVENT and the Out-of-order Previous: Overview of RSIM_EVENT

Instruction fetch and decode

Source files: src/Processor/pipestages.cc, src/Processor/tagcvt.cc, src/Processor/active.cc, src/Processor/stallq.cc

Headers: incl/Processor/state.h, incl/Processor/instance.h, incl/Processor/instruction.h, incl/Processor/mainsim.h, incl/Processor/decode.h, incl/Processor/tagcvt.h, incl/Processor/active.h, incl/Processor/stallq.h

Since RSIM currently does not model an instruction cache, the instruction fetch and decode peline stages are merged. This stage starts with the function decode_cycle, called from maindecode.

The function decode_cycle starts out by looking in the processor stall queue, which consists of instructions that were decoded in a previous cycle but could not be added to the processor active list, either because of insufficient renaming registers or insufficient active list size. The processor will stop decoding new instructions by setting the processor field stall_the_rest after the first stall of this sort, so the stall queue should have at most one element. If there is an instruction in the stall queue, check_dependencies is called for it (described below). If this function succeeds, the instruction is removed from the processor stall queue. Otherwise, the processor continues to stall instruction decoding.

After processing the stall queue, the processor will decode the instructions for the current cycle. If the program counter is valid for the application instruction region, the processor will read the instruction at that program counter, and convert the static instr data strucutre to a dynamic instance data structure through the function decode_instruction. The instance is the fundamental dynamic form of the instruction that is passed among the various functions in RSIM. If the program counter is not valid for the application, the processor checks to see if the processor is in privileged mode. If so, and if the program counter points to a valid instruction in the trap-table, the processor reads an instruction from the trap-table instead. If the processor is not in privileged mode, or the PC is not valid in the trap-table, the processor generates a single invalid instruction that will cause an illegal PC exception. Such a PC can arise through either an illegal branch or jump, or through speculation (in which case the invalid instruction will be flushed before it causes a trap).

The decode_instruction function sets a variety of fields in the instance data structure. First, the various fields associated with the memory unit are cleared, and some fields associated with instruction registers and results are cleared. The relevant statistics fields are also initialized.

Then, the tag field of the instance is set to hold the value of the processor instruction counter. The tag field is the unique instruction id of the instance; currently, this field is set to be unique for each processor throughout the course of a simulation. Then, the win_num field of the instance is set. This represents the processor's register window pointer (cwp or current window pointer) at the time of decoding this instruction.

decode_instruction then sets the functional unit type and initializes dependence fields for this instance. Additionally, the stall_the_rest field of the processor is cleared; since a new instruction is being decoded, it is now up to the progress of this instruction to determine whether or not the processor will stall.

At this point, the instance must determine its logical source registers and the physical registers to which they are mapped. In the case of integer registers (which may be windowed), the function convert_to_logical is called to convert from a window number and architectural register number to an integer register identifier that identifies the logical register number used to index into the register map table (which does not account for register windows). If an invalid source register number is specified, the instruction will be marked with an illegal instruction trap.

At this point, the instance must handle the case where it is an instruction that will change the processor's register window pointer (such as SAVE or RESTORE). The processor provides two fields (CANSAVE and CANRESTORE) that identify the number of windowing operations that can be allowed to proceed [23]. If the processor can not handle the current windowing operation, this instance must be marked with a register window trap, which will later be processed by the appropriate trap handler. Otherwise, the instance will change its win_num to reflect the new register window number.

In a release consistent system, the processor will now detect MEMBAR operations and note the imposed ordering constraints. These constraints will be used by the memory unit.

The instance will now determine its logical destination register numbers, which will later be used in the renaming stage. If the previous instruction was a delayed branch, it would have set the processor's copymappernext field (as described below). If the copymappernext field is set, then this instruction is the delay slot of the previous delayed branch and must try to allocate a shadow mapper. The branchdep field of the instance is set to indicate this.

Now the processor PC and NPC are stored with each created instance. We store program counters with each instruction not to imitate the actual behavior of a system, but rather as a simulator abstraction. If the instance is a branch instruction, the function decode_branch_instruction is called to predict or set the new program counter values; otherwise, the PC is updated to the NPC, and the NPC is incremented. decode_branch_instruction may also set the branchdep field of the instance (for predicted branches that may annul the delay slot), the copymappernext field of the processor (for predicted, delayed branches), or the unpredbranch field of the processor (for unpredicted branches).

If the instance is predicted as a taken branch, then the processor will temporarily set the stall_the_rest field to prevent any further instructions from being decoded this cycle, as we currently assume that the processor cannot decode instructions from different regions of the address space in the same cycle.

After this point, control returns to decode_cycle. This function now adds the decoded instruction to the tag converter, a structure used to convert from the tag of the instance into an instance data structure pointers. This structure is used internally for communication among the modules of the simulator.

Now the check_dependencies function is called for the dynamic instruction. If RSIM was invoked with the ``-q'' option and there are too many unissued instructions to allow this one into the issue window, this function will stall further decoding and return. If RSIM was invoked with the ``-X'' option for static scheduling and even one prior instruction is still waiting to issue (to the ALU, FPU, or address generation unit), further decoding is stopped and this function returns. Otherwise, this function will attempt to provide renaming registers for each of the destination registers of this instruction, stalling if there are none available. As each register is remapped in this fashion, the old mapping is added to the active list (so that the appropriate register will be freed when this instruction graduates), again stalling if the active list has filled up. It is only after this point that a windowing instruction actually changes the register window pointer of the processor, updating the CANSAVE and CANRESTORE fields appropriately. Note that single-precision floating point registers (referred to as REG_FPHALF) are mapped and renamed according to double-precision boundaries to account for the register-pairing present in the SPARC architecutre [23]. As a result, single-precision floating point codes are likely to experience significantly poorer performance than double-precision codes, actually experiencing the negative effects of anti-dependences and output-dependences which are otherwise resolved by register renaming.

If a resource was not available at any point above, check_dependencies will set stall_the_rest and return an error code, allowing the instance to be added to the stall queue. Although the simulator assumes that there are enough renaming registers for the specified active-list size by default, check_dependences also includes code to stall if the instruction could not obtain its desired renaming registers.

After the instance has received its renaming registers and active list space, check_dependences continues with further processing. If the instruction requires a shadow mapper (has branchdep set to 2, as described above), the processor tries to allocate a shadow mapper by calling AddBranchQ. If a shadow mapper is available, the branchdep field is cleared. Otherwise, the stall_the_rest field of the processor is set and the instance is added to the queue of instructions waiting for shadow mappers. If the processor had its unpredbranch field set, the stall_the_rest field is set, either at the branch itself (on an annulling branch), or at the delay slot (for a non-annulling delayed branch).

The instance now checks for outstanding register dependences. The instance checks the busy bit of each source register (for single-precision floating-point operations, this includes the destination register as well). For each busy bit that is set, the instruction is put on a distributed stall queue for the appropriate register. If any busy bit is set, the truedep field is set to 1. If the busy bits of rs2 or rscc are set, the addrdep field is set to 1 (this field is used to allow memory operations to generate their addresses while the source registers for their value might still be outstanding).

If the instruction is a memory operation, it is now dispatched to the memory unit, if there is space for it. If there is no space, either the operation is attached to a queue of instructions waiting for the memory unit (if the processor has dynamic scheduling and ``-q'' was not used to invoke RSIM), or the processor is stalled until space is available (if the processor has static scheduling, or has dynamic scheduling with the ``-q'' option to RSIM).

If the instruction has no true dependences, the SendToFU function is called to allow this function to issue in the next stage.

decode_cycle continues looping until it decodes all the instructions it can (and is allowed to by the architectural specifications) in a given cycle.

Next: Branch prediction Up: RSIM_EVENT and the Out-of-order Previous: Overview of RSIM_EVENT

Vijay Sadananda Pai
Thu Aug 7 14:18:56 CDT 1997