#### Optimal ILP and Register Tiling: Analytical Model and Optimization Framework

Lakshminarayanan. Renganarayana, Upadrasta Ramakrishna, Sanjay Rajopadhye Computer Science Department Colorado State University

#### Overview

- ILP and register reuse
- Execution time and register pressure functions
- Optimal ILP and register tiling problem
- Optimal tiling problem as convex opt. problem
- Validation
- Related work
- Conclusions & Future work

# **ILP and Register Reuse**

- Loop programs
  - dominate application execution time
  - □ main sources of ILP and register reuse
- Transformations
  - expose / exploit ILP
  - enable register reuse
- These transformations interact in subtle ways
- ILP Register Reuse tradeoff?

## **ILP - Register Reuse Tradeoff**

- Optimal combination of transformations
- Quantification of interactions
- A mathematical model
  - to study the interactions
  - □ to choose the optimal trans. parameters
- TTBOOK: no such model has been studied

#### Contributions

- Cost model with trans. params. as variables
  - □ closed forms: execution time & register pressure
- Convex optimization problem formulation
- A globally optimal solution
- First such formulation & optimal solution

# **Exposing and Exploiting ILP**

- Exposing ILP
  - Unroll and Jam
  - Loop permutation or skewing
  - Multi-dimensional scheduling
- Exploiting ILP
  - DAG schedulers
  - Software pipelining

## **Exposing ILP with Unroll and Jam**



#### **Exposing ILP with Permutation**



October 21, 2005

LCPC '05

### **Exposing ILP with Skewing**



October 21, 2005

## **Register Reuse**

- Unrol and Jam  $\rightarrow$  Scalar replacement
  - scalar replacement enables register placement
  - classic register allocators are sufficient
- Loop tiling  $\rightarrow$  array register allocation
  - registers allocated to array values
  - no code size increase
  - 😄 requires an array register allocator

### **Scalar Replacement**



#### Which array references to scalar replace?



# **Register tiling**



for  $i_1 = 1$  to 6 for  $i_2 = 1$  to 6  $A[i_1, i_2] = A[i_1-1, i_2] + A[i_1, i_2-1]$ 

3x3 register tile

Tile sizes:

Affects load/store savings
Constrained by number of registers
How to choose the tile sizes?

#### **Traditional vs. Our Approach**



### **Program, Tiling, and Architecture Class**

- Input loops:
  - perfectly nested, rectangular loops
  - uniform dependence bodies
- Rectangular tiling
  - we assume: input loop nest admits rectangular tiling
- ILP-exposed by: permutation or skewing
- Architectures: superscalar or VLIW

#### **Execution Time**

#### (When permutation exposes ILP)



T = (ntiles \* tile\_cost) + loop\_overhead

tile\_cost =
max(comp\_cost,load\_store\_cost)

 $comp\_cost = \alpha * tile\_vol$ 

load\_store\_cost =  $\beta * LS(t,D)$ 

loop\_overhead =  $\eta * LO(1,N)$ 

ntiles = 
$$N_1/t_1 * ... * N_n/t_n$$

t = vector of tile sizes N = vector of iter. space sizes D = dependence matrix

October 21, 2005

#### **Execution Time Model**

#### (when permutation cannot expose ILP: skew)

#### Skewing affects

- iteration space shape -- makes counting of partial, full, and no. of tiles hard.
- dependence lengths -- affects the amount of data loaded / stored in a tile.



#### **Optimal ILP and Register Tiling: Optimization Problem Formulation**

*minimize* TotalExecutionTime(t,S) subject to  $LoadStoreVolume(t,S) \leq Registers$ 

For a fixed skew **S** 

- $\checkmark$  t is the only variable
- $\checkmark$  opt. prob. reduces to an integer convex opt. prob.

# **Solution Steps**

#### **Can permutation expose a parallel loop?**

Yes!

- No skewing, only tiling
  - Fix S=I in opt. prob.
- Solve for optimal tile sizes
- Single integer convex opt. problem.

#### No!

- Construct set (Γ)of valid skews
- For each element in Γ solve the fixed skew optimization problem
- Pick the best
- Only *d(d-1)* problems

## **Solving for Optimal Tile Sizes**

- Opt. Prob. for tile sizes is a Integer Geometric
   Program (à la Integer Linear Programs)
- GPs can be transformed into convex opt. probs.
- Standard solvers are available
- Running time:
  - □ depends on #vars & #constraints
  - $\Box$  few seconds (< 10 secs.)

### Validation

- Experimental validation requires
  - array register allocator
  - architectural support (like rotating registers)
- Similar model used for finding optimal unroll factor
  - optimal unroll factors can be found with small tweaks
- In tiling for memory hierarchy
  - we have successfully used a similar model
  - almost all the cost models used by other researchers can be cast into our GP framework [RR-SC04]

### **Related Work**

- Unroll and Jam approach
  - □ [Callhan et al.-90], [Carr-Kennedy-94], [Sarkar-01]
- Hierarchical tiling
  - □ [Carter et al.-95], [Mitchell et al.-98]
- Software pipelining of loop nests
  - □ [Ramanujam-94], [Rong et al. 04], [Rong et al. 05]
- Code generation for register tiling
  - □ [Jiminez et al.-02], [Sarkar-01]

#### **Conclusions & Future Work**

- A mathematical formulation of the combined ILP and register tiling problem.
- A globally optimal solution.
- Future work:
  - adapting modulo schedulers to pipeline skewed loops
  - developing an array register allocator
  - experimental validation on benchmarks