### **Computer Organization and Architecture** Designing for Performance

### 11<sup>th</sup> Edition



### Chapter 17

Reduced Instruction Set Computers

Copyright © 2019, 2016, 2013 Pearson Education, Inc. All Rights Reserved

### Table 17.1 Characteristics of Some CISCs, RISCs, and Superscalar Processors (1 of 2)

|                                         |                | olex Instruction<br>CISC)Computer | Reduced Instruction<br>Set (RISC) Computer |        |               |
|-----------------------------------------|----------------|-----------------------------------|--------------------------------------------|--------|---------------|
| Characteristic                          | IBM<br>370/168 | VAX<br>11/780                     | Intel<br>80486                             | SPARC  | MIPS<br>R4000 |
| Year developed                          | 1973           | 1978                              | 1989                                       | 1987   | 1991          |
| Number of instructions                  | 208            | 303                               | 235                                        | 69     | 94            |
| Instruction size (bytes)                | 2–6            | 2–57                              | 1–11                                       | 4      | 4             |
| Addressing modes                        | 4              | 22                                | 11                                         | 1      | 1             |
| Number of general-<br>purpose registers | 16             | 16                                | 8                                          | 40–520 | 32            |
| Control memory size<br>(kbits)          | 420            | 480                               | 246                                        | -      | -             |
| Cache size (kB)                         | 64             | 64                                | 8                                          | 32     | 128           |

(Table can be found on page 589 in the textbook.)

# Table 17.1Characteristics of Some CISCs, RISCs,and Superscalar Processors (2 of 2)

|            |                                     | 1       | Superscalar    |                |            |
|------------|-------------------------------------|---------|----------------|----------------|------------|
| 0          | Characteristic                      | PowerPC | Ultra<br>SPARC | MIPS<br>R10000 |            |
|            | Year developed                      | 1993    | 1996           | 1996           |            |
| $\bigcirc$ | Number of instructions              | 225     |                |                |            |
| $\bigcirc$ | Instruction size (bytes)            | 4       | 4              | 4              |            |
|            | Addressing modes                    | 2       | 1              | 1              |            |
|            | Number of general-purpose registers | 32      | 40–520         | 32             |            |
|            | Control memory size<br>(kbits)      | -       | _              | _              |            |
|            | Cache size (kB)                     | 16–32   | 32             | 64             | $\bigcirc$ |
|            |                                     |         |                |                |            |

(Table can be found on page 589 in the textbook.)

### Instruction © Execution Characteristics

### High-level languages (HLLs)

- •Allow the programmer to express algorithms more concisely
- •Allow the compiler to take care of details that are not important in the programmer's expression of algorithms
- •Often support naturally the use of structured programming and/or object-oriented design

#### **Execution sequencing**

•Determines the control and pipeline organization

### Semantic gap

•The difference between the operations provided in HLLs and those provided in computer architecture

#### **Operands used**

•The types of operands and the frequency of their use determine the memory organization for storing them and the addressing modes for accessing them

#### **Operations** performed

•Determine the functions to be performed by the processor and its interaction with memory

niversidad de Costa Rica

# Table 17.2Image: Constraint of the second secon

| 0-     | Dynamic Occurrence |     |        | -Instruction<br>ighted | Memory-Reference<br>Weighted |     |  |  |  |
|--------|--------------------|-----|--------|------------------------|------------------------------|-----|--|--|--|
|        | Pascal             | С   | Pascal | С                      | Pascal                       | С   |  |  |  |
| ASSIGN | 45%                | 38% | 13%    | 13%                    | 14%                          | 15% |  |  |  |
| LOOP   | 5%                 | 3%  | 42%    | 32%                    | 33%                          | 26% |  |  |  |
| CALL   | 15%                | 12% | 31%    | 33%                    | 44%                          | 45% |  |  |  |
| IF     | 29%                | 43% | 11%    | 21%                    | 7%                           | 13% |  |  |  |
| GOTO   | _                  | 3%  | _      | -                      | _                            | _   |  |  |  |
| OTHER  | 6%                 | 1%  | 3%     | 1%                     | 2%                           | 1%  |  |  |  |
|        |                    |     |        |                        |                              |     |  |  |  |

(Table can be found on page 591 in the textbook.)

|                                       |        |     |         | $\bigcirc$ |  |  |  |
|---------------------------------------|--------|-----|---------|------------|--|--|--|
|                                       | Pascal | С   | Average |            |  |  |  |
| Integer constant                      | 16%    | 23% | 20%     | $\odot$    |  |  |  |
| Scalar variable                       | 58%    | 53% | 55%     | $\bigcirc$ |  |  |  |
| Array/Structure                       | 26%    | 24% | 25%     | $\odot$    |  |  |  |
| O O O O O O O O O O O O O O O O O O O |        |     |         |            |  |  |  |

# Table 17.4Image: Constraint of the second secon

| Percentage of Executed<br>Procedure Calls With | Compiler, Interpreter,<br>and Typesetter | Small Nonnumeric<br>Programs |
|------------------------------------------------|------------------------------------------|------------------------------|
| > 3 arguments                                  | 0–7%                                     | 0–5%                         |
| > 5 arguments                                  | 0–3%                                     | 0%                           |
| > 8 words of arguments<br>and local scalars    | 1–20%                                    | 0–6%                         |
| > 12 words of arguments<br>and local scalars   | 1—6%                                     | 0–3%                         |

(Table can be found on page 592 in the textbook.)

# Implications

- HLLs can best be supported by optimizing performance of the most time-consuming features of typical HLL programs
- Three elements characterize RISC architectures:
   Use a large number of registers or use a compiler to optimize register usage
  - Careful attention needs to be paid to the design of instruction pipelines
  - Instructions should have predictable costs and be consistent with a high-performance implementation

## The Use of a Large Register File

Software Solution • Hardware Solution

- Requires compiler to allocate registers
- Allocates based on most used variables in a given time
- Requires sophisticated program analysis

More registers

2022

Thus more variables will be in registers

9

Universidad de Costa Rica

### Figure 17.1 Overlapping Register Windows





CO. Arrov

## Global Variables

 Variables declared as global in an HLL can be assigned memory locations by the compiler and all machine instructions that reference these variables will use memory reference operands

However, for frequently accessed global variables this scheme is inefficient 💽

- Alternative is to incorporate a set of global registers in the processor
   These registers would be fixed in number and available to all procedures
   A unified numbering scheme can be used to simplify the instruction format
- There is an increased hardware burden to accommodate the split in register addressing
- In addition, the linker must decide which global variables should be assigned to registers

# Table 17.5Characteristics of Large-Register-Fileand Cache Organizations

| Large Register File                                   | Cache                                             |
|-------------------------------------------------------|---------------------------------------------------|
| All local scalars                                     | Recently-used local scalars                       |
| Individual variables                                  | Blocks of memory                                  |
| Compiler-assigned global variables                    | Recently-used global variables                    |
| Save/Restore based on procedure nesting depth         | Save/Restore based on cache replacement algorithm |
| Register addressing                                   | Memory addressing                                 |
| Multiple operands addressed and accessed in one cycle | One operand addressed and accessed per cycle      |
|                                                       | (Table can be found on page 597 in the textbook   |

Jniversidad de Costa Rica





CO. Arrol

### Why CISC ?

### (Complex Instruction Set Computer)

- There is a trend to richer instruction sets which include a larger and more complex number of instructions
- Two principal reasons for this trend:
  - A desire to simplify compilers
    - A desire to improve performance
- There are two advantages to smaller programs:
  - The program takes up less memory
  - Should improve performance
    - Fewer instructions means fewer instruction bytes to be fetched

- In a paging environment smaller programs occupy fewer pages, reducing page faults
- More instructions fit in cache(s)

# Table 17.6Code Size Relative to RISC I

|            | [PATT82a] 11 C<br>Programs | [KATE83] 12 C<br>Programs | [HEAT84] 5 C<br>Programs |
|------------|----------------------------|---------------------------|--------------------------|
| RISC I     | 1.0                        | 1.0                       | 1.0                      |
| VAX-11/780 | 0.8                        | 0.67                      |                          |
| M68000     | 0.9                        |                           | 0.9                      |
| Z8002      | 1.2                        |                           | 1.12                     |
| PDP-11/70  | 0.9                        | 0.71                      |                          |

(Table can be found on page 601 in the textbook.)

Universidad de Costa Rica 17



### Characteristics of Reduced Instruction Set Architectures (2 of 2)

### "Circumstantial Evidence"

- More effective optimizing compilers can be developed
  - With more primitive instructions, there are more opportunities for moving functions out of loops, reorganizing code for efficiency and maximizing register utilization
  - It is even possible to compute parts of complex instructions at compile time
- Most instructions generated by a compiler are relatively simple anyway
  - It would seem reasonable that a control unit built specifically for those instructions and using little or no microcode could execute them faster than a comparable CISC
- RISC researchers feel that the instruction pipelining technique can be applied much more effectively with a reduced instruction set
- RISC processors are more responsive to interrupts because interrupts are checked between rather elementary operations
  - Architectures with complex instructions either restrict interrupts to instruction boundaries or must refine specific interruptible points and implement mechanisms for restarting an instruction



## Table 17.7 Characteristics of Some Processors

| Processor      | Number of<br>instruction<br>sizes | Max<br>instruction<br>size<br>in bytes | Number of<br>addressing<br>Modes | Indirect        | Load/store<br>combined<br>with<br>arithmetic | Max<br>number<br>of<br>memory<br>operands | Unaligned<br>addressing<br>Allowed | Max<br>number<br>of MMU<br>uses | Number of<br>bits for<br>integer<br>register<br>specifier | Number<br>of bits for<br>FP register<br>specifier |
|----------------|-----------------------------------|----------------------------------------|----------------------------------|-----------------|----------------------------------------------|-------------------------------------------|------------------------------------|---------------------------------|-----------------------------------------------------------|---------------------------------------------------|
| AMD29000       | 1                                 | 4                                      | 1                                | no              | no                                           | 1                                         | no                                 | 1                               | 8                                                         | 3ª                                                |
| MIPS<br>R2000  | 1                                 | 4                                      | 1                                | no              | no                                           | 1                                         | no                                 | 1                               | 5                                                         | 4                                                 |
| SPARC          | 1                                 | 4                                      | 2                                | no              | no                                           | 1                                         | no                                 | 1                               | 5                                                         | 4                                                 |
| MC88000        | 1                                 | 4                                      | 3                                | no              | no                                           | 1                                         | no                                 | 1                               | 5                                                         | 4                                                 |
| HP PA          | 1                                 | 4                                      | 10 <sup>a</sup>                  | no              | no                                           | 1                                         | no                                 | 1                               | 5                                                         | 4                                                 |
| IBM RT/PC      | 2 <sup>a</sup>                    | 4                                      | 1                                | no              | no                                           | 1                                         | no                                 | 1                               | 4 <sup>a</sup>                                            | 3 <sup>a</sup>                                    |
| IBM<br>RS/6000 | 1                                 | 4                                      | 4                                | no              | no                                           | 1                                         | yes                                | 1                               | 5                                                         | 5                                                 |
| Intel i860     | 1                                 | 4                                      | 4                                | no              | no                                           | 1                                         | no                                 | 1                               | 5                                                         | 4                                                 |
| IBM 3090       | 4                                 | 8                                      | 2 <sup>b</sup>                   | no <sup>b</sup> | yes                                          | 2                                         | yes                                | 4                               | 4                                                         | 2                                                 |
| Intel 80486    | 2 12                              | 12                                     | 15                               | no <sup>b</sup> | yes                                          | 2                                         | yes 2                              | 4                               | 3                                                         | 3                                                 |
| NSC 32016      | 21                                | 21                                     | 23                               | yes             | yes                                          | 2                                         | yes                                | 4                               | 3                                                         | 3                                                 |
| MC68040        | 11                                | 22                                     | 44                               | yes             | yes                                          | 2                                         | yes                                | 8                               | 4                                                         | 3                                                 |
| VAX            | 56                                | 56                                     | 22                               | yes             | yes                                          | 6                                         | yes                                | 24                              | 4                                                         | 0                                                 |
| Clipper        | 4 <sup>a</sup>                    | 8 <sup>a</sup>                         | 9 <sup>a</sup>                   | no              | no                                           | 1                                         | 0                                  | 2                               | 4 <sup>a</sup>                                            | 3 <sup>a</sup>                                    |
| Intel 80960    | 2 <sup>a</sup>                    | 8 <sup>a</sup>                         | 9 <sup>a</sup>                   | no              | no                                           | 1                                         | yesª                               | -                               | 5                                                         | 3ª                                                |

a RISC hat does not conform to this characteristic

b CISC that does not conform to this characteristic

(Table can be found on page 605 in the textbook.)



## Optimization of Pipelining

- Delayed branch
  - Does not take effect until after execution of following instruction
  - This following instruction is the delay slot
- Delayed Load
  - Register to be target is locked by processor
  - Continue execution of instruction stream until register required
  - Idle until load is complete
    - Re-arranging instructions can allow useful work while loading

2022

- Loop Unrolling
  - Replicate body of loop a number of times
  - Iterate loop fewer times
  - Reduces loop overhead
  - Increases instruction parallelism
  - Improved register, data cache, or TLB locality

# Table 17.8Image: Constraint of the second secon

| Address                                           | Normal Branch                | Delayed Branch | Optimized<br>Delayed Branch |  |  |  |  |
|---------------------------------------------------|------------------------------|----------------|-----------------------------|--|--|--|--|
| 100                                               | LOAD X, rA                   | LOAD X, rA     | LOAD X, rA                  |  |  |  |  |
| 101                                               | ADD 1, rA                    | ADD 1, rA      | JUMP 105                    |  |  |  |  |
| 102                                               | JUMP 105                     | JUMP 106       | ADD 1, rA                   |  |  |  |  |
| 103                                               | ADD rA, rB                   | NOOP           | ADD rA, rB                  |  |  |  |  |
| 104                                               | SUB rC, rB                   | ADD rA, rB     | SUB rC, rB                  |  |  |  |  |
| 105                                               | STORE rA, Z                  | SUB rC, rB     | STORE rA, Z                 |  |  |  |  |
| 106                                               |                              | STORE rA, Z    |                             |  |  |  |  |
| (Table can be found on page 608 in the textbook.) |                              |                |                             |  |  |  |  |
|                                                   | Universidad de Costa Rica 24 |                |                             |  |  |  |  |





o. Arro

### MIPS R4000

One of the first commercially available RISC chip sets was developed by MIPS Technology Inc.

Inspired by an experimental system developed at Stanford Has substantially the same architecture and instruction set of the earlier MIPS designs (R2000 and R3000)

Uses 64 bits for all internal and external data paths and for addresses, registers, and the ALU Is partitioned into two sections, one containing the CPU and the other containing a coprocessor for memory management

Supports thirty-two 64-bit registers

Provides for up to 128 Kbytes of high-speed cache, half each for instructions and data

### Universidad de Costa Rica 27



## Instruction Pipeline

- With its simplified instruction architecture, the MIPS can achieve very efficient pipelining
- The initial experimental RISC systems and the first generation of commercial RISC processors achieve execution speeds that approach one instruction per system clock cycle
- To improve on this performance, two classes of processors have evolved to offer execution of multiple instructions per clock cycle
  - Superscalar architecture
    - Replicates each of the pipeline stages so that two or more instruction at the same stage of the pipeline can be processed simultaneously
    - Limitations are: dependencies between instructions in different pipelines can slow down the system, and, overhead logic is required to coordinate these dependencies
  - Super-pipelined architecture
    - Makes use of more, and more fine-grained, pipeline stages
    - With more stages, more instruction can be in the pipeline at the same time, increasing parallelism
    - Limitation: there is overhead associated with transferring instructions from one stage to the next



C

## Table 17.9R3000 Pipeline Stages

| Pipeline<br>Stage | Phase                                                                                                                | Function                                                                                                                                |            |  |  |  |
|-------------------|----------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------|------------|--|--|--|
| IF                | $\phi$ 1 Using the TLB, translate an instruction virtual address to a physical address (after a branching decision). |                                                                                                                                         |            |  |  |  |
| IF                | φ2                                                                                                                   | Send the physical address to the instruction address.                                                                                   |            |  |  |  |
| RD                | <i>φ</i> 1                                                                                                           | Return instruction from instruction cache.<br>Compare tags and validity of fetched instruction.                                         | -0         |  |  |  |
| RD                | φ2                                                                                                                   | Decode instruction.<br>Read register file.<br>If branch, calculate branch target address.                                               | -0         |  |  |  |
| ALU               | <i>φ</i> 1 <b>+</b> <i>φ</i> 2                                                                                       | If register-to-register operation, the arithmetic or logical operation is performed.                                                    |            |  |  |  |
| ALU               | <i>φ</i> 1                                                                                                           | If a branch, decide whether the branch is to be taken or not.<br>If a memory reference (load or store), calculate data virtual address. | -0         |  |  |  |
| ALU               | φ2                                                                                                                   | If a memory reference, translate data virtual address to physical using TLB.                                                            |            |  |  |  |
| MEM               | <i>φ</i> 1                                                                                                           | If a memory reference, send physical address to data cache.                                                                             |            |  |  |  |
| MEM               | φ2                                                                                                                   | If a memory reference, return data from data cache, and check tags.                                                                     | ()         |  |  |  |
| WB                | <i>φ</i> 1                                                                                                           | Write to register file.                                                                                                                 | $\bigcirc$ |  |  |  |



\_

=

DS

- IF = Instruction fetch first half DC IS = Instruction fetch second half DF
- RF = Fetch operands from register
- EX = Instruction execute
- IC = Instruction cache

TC = Tag check WB = Write back to register file

Data cache first half

Data cache second half

Data cache

### **R4000 Pipeline Stages**

- Instruction fetch first half
  - Virtual address is presented to the
  - instruction cache and the translation lookaside buffer
- Instruction fetch second half
  - Instruction cache outputs the instruction and the TLB generates the physical address
- Register file
  - One of three activities can occur:
  - Instruction is decoded and check made for interlock conditions
  - Instruction cache tag check is made
  - Operands are fetched from the register file
- Tag check
  - Cache tag checks are performed for loads and stores

#### Instruction execute

- One of three activities can occur:
  - If register-to-register operation the ALU performs the operation
  - If a load or store the data virtual address is calculated
  - If branch the branch target virtual address is calculated and branch operations checked
- Data cache first
  - Virtual address is presented to the data cache and TLB
- Data cache second
  - The TLB generates the physical address and the data cache outputs the data
  - Write back
    - Instruction result is written back to register file

### Sparc Scalable Processor Architecture

- Architecture defined by Sun Microsystems
- Sun licenses the architecture to other vendors to produce SPARC-compatible machines
- Inspired by the Berkeley RISC 1 machine, and its instruction set and register organization is based closely on the Berkeley RISC model

Universidad de Costa Rica

### Figure 17.12 SPARC Register Window Layout with Three Procedures



o. Arroy



# Table 17.10Synthesizing Other Addressing Modeswith SPARC Addressing Modes

| Instruction Type     | Addressing Mode   | Algorithm    | SPARC Equivalent                  |
|----------------------|-------------------|--------------|-----------------------------------|
| Register-to-register | Immediate         | operand = A  | S2                                |
| Load, store          | Direct            | EA = A       | $R_0 + S_2$                       |
| Register-to-register | Register          | EA = R       | R <sub>S1</sub> , S <sub>S2</sub> |
| Load, store          | Register Indirect | EA = (R)     | R <sub>S1</sub> + 0               |
| Load, store          | Displacement      | EA = (R) + A | R <sub>S1</sub> + S2              |

*Note*: S2 = either a register operand or a 13-bit immediate operand.

(Table can be found on page 619 in the textbook.)



io. Arroy

IVERSIDAD DE COSTA KICA 38



niversidad de Costa Rica

### **Processor Organization for Pipelining**

- Three more features to enhance performance are:
  - Multiple reservation stations
  - Forwarding
  - Reorder buffer
- The process or dispatching an instruction to a functional unit proceeds in two parts:
   Issue from ID to reservation station
  - Dispatch from reservation station to FU
- The reservation station is also referred to as an *instruction window*
- Data forwarding addresses the problem of read-after-write (RAW) delays due to WB delays As with the store buffer, data forwarding makes data available as soon as it is created
  - The forwarded data becomes input to the reservation stations, going to an operand field
- The reorder buffer supports out-of-order execution (OoOE)
  - OoOE is an approach to processing that allows instructions for high-performance microprocessors to begin execution as soon as their operands are ready
  - The goal of OoO processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable



### **Summary**

### Chapter 17

- Instruction execution characteristics
  - Operations
  - Operands
  - Procedure calls
  - Implications
- The use of a large register file
  - Register windows
  - Global variables
  - Large register file versus cache
- Reduced instruction set architecture
  - Characteristics of RISC
  - CISC versus RISC characteristics

- RISC pipelining
  - Pipelining with regular instructions
  - Optimization of pipelining

Reduced Instruction

**Set Computers** 

(RISC)

- MIPS R4000
  - Instruction set
  - Instruction pipeline

2022

- SPARC register set
- Instruction set

SPARC

- Instruction format
- Processor Organization for Pipelining
- CISC, RISC, and contemporary <sup>42</sup>