

# Clock Tree Synthesis Tutorial and iCTS Software

Weiguo Li Minnan Normal University dawnli619215645@gmail.com

May 10-13, 2024 | Xi'an, China



#### **Overview**



#### Where's CTS







Floorplan determining the area of the Die and Core, and placing the macro cell.





**Placement** placing standard cells into legal positions to achieve the optimal design object.





**CTS** designs a clock network, ensuring clock synchronization, achieving low-power and high-performance.





**Routing** materializes all nets to keep the wirelength as short as possible and adhere to the design rules.



#### What's CTS?



- CTS, the bridge between Placement and Routing
- Achieving skew balance and minimize design resource (buffers, wirelength)

EDA

#### What's our concern





- Timing of Clock Tree
  - skew

...

- latency
- worst negative slack (WNS)
- total negative slack (TNS)

- Design Resource
  - num of buffer insertions
  - wirelength
  - capacitance
  - power/area

...

#### What's the operation in CTS

- Routing Topology
  - starting from the driver pin
  - connect all load pins
  - Steiner tree properties

- Buffering
  - decompose the net
  - reduce fanout (4  $\rightarrow$  2)
  - enhanced driving capability

> delay/wirelength/capacitance/...







#### What's we expecting



May 10-13, 2024 | Xi'an, China



#### **Overview**



#### **CTS Formula**

EDA

We define the initial distribution of **flip-flops (sinks)** in the CTS stage as S. Given a clock source (root), we can establish **clock skew** constraint as follow:

$$\Delta delay\left(s_{i}
ight) = \max_{s_{i} \in \mathcal{S}} \left\{ delay\left(s_{i}
ight) 
ight\} - \min_{s_{i} \in \mathcal{S}} \left\{ delay\left(s_{i}
ight) 
ight\} \leq skew\_bound,$$

where  $delay(s_i)$  represents the delay from clock source to sink  $s_i$  (latency). We define the set of load pins for each clock net as a **cluster**, i.e.,  $u_i \in \mathcal{U}$ , and generate a clock net, i.e.,  $n_i \in \mathcal{N}$ .





#### **CTS Formula**

• CTS needs to satisfy additional design constraints, such as the **fanout** constraint:

 $|u_j| \leq max\_fanout,$ 

the maximum **wirelength** constraint:

 $WL(n_j) \leq max\_length,$ 

and the maximum clock net **capacitance** constraint:



#### **CTS Formula**

• CTS problem can be defined as a multi-objective optimization problem:

 $egin{aligned} \min & f(\mathcal{N}, \mathcal{U}, \mathcal{S}) \ s.t. & \Delta delay\left(s_{i}
ight) \leq skew\_bound, \ & |u_{j}| \leq max\_fanout, \ & WL\left(n_{j}
ight) \leq max\_length, \ & \sum_{s_{i} \in u_{i}} Cap_{pin}(s_{i}) + c \cdot WL\left(n_{j}
ight) \leq max\_cap, \end{aligned}$ 

 $f(\mathcal{N}, \mathcal{U}, \mathcal{S})$  represents the objective function, such as **design resources** and **power**.

#### It's a complex multi-objective optimization problem



#### **Load Capacitance**



EDA



## Wire (Interconnect) Delay

• Elmore  $\pi$ -Model

EDA



## **Slew (Transition)**

• Bakoglu Metric<sup>[1]</sup> & PERI Model<sup>[2]</sup>

$$Slew_{ideal}(s,t) = \ln 9 \cdot D_{wire}(s,t),$$

$$Slew_{wire}(t) = \sqrt{Slew_{wire}^2(s) + Slew_{ideal}^2(s,t)}$$
.



## Look-up Table (LUT)

- *Delay/Slew<sub>out</sub>* = *LUT*(*Slew<sub>in</sub>*, *Cap<sub>load</sub>*)
  - upstream input slew
  - downstream load capacitance
  - not reliant on complex computational models

- Model
  - bilinear interpolation
  - ML fitting...



timing() {

"Slew" : [...],

"Cap" : [...],

## **Buffer Delay**

• Linear Fitting<sup>[3]</sup>

EDA



#### This type of modeling is only effective for clock buffers



### **Buffer Slew**

• Linear Fitting<sup>[3]</sup>

$$Slew_{out}(*) = LUT_{slew}(Slew_{in}(*), Cap_{load}(*)),$$

$$Slew_{\scriptscriptstyle out}(*) = lpha \cdot Cap_{\scriptscriptstyle load}(*) + eta.$$



- Benefit
  - sufficient accuracy
  - independence from the upstream information (under certain conditions)

#### **Clock Topology**



Eda

## Buffering

#### • Effect

- reduce fanout
- affect path delay
- enhance driving capability (transition)
- Property
  - size/area
  - power/timing
  - library (for LUT)
- Optimization
  - insertion
  - resizing







#### **Overview**



Skew Constraint Graph (SCG)

#### **CTS Components**



EDA

May 10-13, 2024 | Xi'an, China

#### iCTS Software

api













#### **Overview**





#### Bound Skew Tree (BST/DME)<sup>[9]</sup>

#### Bottom-up Stage

- merge two sub-tree
- determine merge region
- Top-down Stage
  - location embedding
  - nearest endpoint
- Benefit
  - zero/bound skew
  - topology based





**Bottom-up** 

Top-down

#### **Fundamentals of Steiner Trees**

- $PL(s_i)$ , the **path length** from the source
- *MD*(*s*<sub>*i*</sub>), the **Manhattan distance** from the source
- *WL*(*T*), the **total wirelength** of clock tree *T*
- $WL(T_{FLUTE})$ , the wirelength of FLUTE<sup>[10]</sup> tree



#### **SALT**<sup>[11]</sup>

• Shallowness (path length)

$$lpha = \max_{s_i \in \mathcal{S}} \Big\{ rac{PL\left(s_i
ight)}{MD\left(s_i
ight)} \Big\},$$

• Lightness (tree weight)

$$\beta = \frac{WL(T)}{WL(MST(G))} \approx \frac{WL(T)}{WL(T_{FLUTE})}$$



Given  $\epsilon$ , SALT realizes  $\alpha \leq 1 + \epsilon$  and  $\beta \leq 2 + \left[ log \frac{2}{\epsilon} \right]$ 

#### **Metric Mapping**





#### Skewness\*

EDA

• The skewness of the Steiner tree is defined as:



#### Skew-Latency-Load Tree (SLLT)\*

• A rectilinear Steiner tree with  $\alpha \leq \overline{\alpha}$ ,  $\beta \leq \overline{\beta}$ , and  $\gamma \leq \overline{\gamma}$ , is denoted as  $(\overline{\alpha}, \overline{\beta}, \overline{\gamma}) - SLLT$ .



| Algorithm | Max<br>PL | Min<br>PL | Total<br>WL | Mean<br>PL | α    | β    | γ    | Mean | Skew<br>Control |
|-----------|-----------|-----------|-------------|------------|------|------|------|------|-----------------|
| H-tree    | 10        | 9         | 55.5        | 9.75       | 2    | 1.32 | 1.03 | 1.45 | $\checkmark$    |
| GH-tree   | 10        | 7         | 47.5        | 8.5        | 1.6  | 1.13 | 1.18 | 1.3  | $\checkmark$    |
| ZST       | 10.5      | 10.5      | 55.5        | 10.5       | 2.33 | 1.32 | 1.00 | 1.55 | $\checkmark$    |
| BST       | 10        | 8         | 50          | 9.25       | 2.25 | 1.19 | 1.08 | 1.51 | $\checkmark$    |
| FLUTE     | 9         | 5         | 42          | 7.44       | 1.4  | 1.00 | 1.21 | 1.2  | ×               |
| R-SALT    | 9         | 5         | 43          | 7.06       | 1.00 | 1.02 | 1.27 | 1.10 | ×               |
| CBS*      | 9         | 7         | 45          | 8.13       | 1.4  | 1.07 | 1.11 | 1.19 | $\checkmark$    |

## **Concurrent BST and SALT (CBS)**\*

Bound Skew Tree (BST)



Lacks the ability to balance skew

•



#### Comparison

Table 1: Wirelength (um) comparison between R-SALT and CBS.

|               | Ģ                         | GreedyDis | st    | Gr          | reedyMer | ge    | BiPartition |       |       |  |  |
|---------------|---------------------------|-----------|-------|-------------|----------|-------|-------------|-------|-------|--|--|
| Skew (ps)     | 80                        | 10        | 5     | 80          | 10       | 5     | 80          | 10    | 5     |  |  |
| <b>R-SALT</b> | 314.4                     | 314.3     | 315.1 | 312.6       | 313.0    | 315.6 | 312.2       | 312.4 | 312.7 |  |  |
| CBS           | 306.0                     | 307.1     | 316.1 | 305.2 306.3 |          | 314.3 | 305.3       | 305.6 | 312.7 |  |  |
| Reduce        | <b>2.67% 2.29% -0.32%</b> |           | 2.37% | 2.14%       | 0.41%    | 2.21% | 2.18%       | 0.00% |       |  |  |

Table 2: Comparison on wirelength, cap and delay between BST-DME and CBS.

|                | Wir           | elength ( | (um)          |        | Cap $(fF)$ |        | Wire Delay (ps) |        |        |  |  |
|----------------|---------------|-----------|---------------|--------|------------|--------|-----------------|--------|--------|--|--|
| Skew (ps)      | 80            | 10        | 5             | 80     | 10         | 5      | 80              | 10     | 5      |  |  |
| <b>BST-DME</b> | 363.3         | 367.6     | 373.2         | 77.4   | 78.1       | 79.1   | 15.3            | 11.5   | 10.2   |  |  |
| CBS            | 305.6         | 306.1     | 314           | 67.5   | 67.6       | 68.9   | 11.2            | 9.2    | 7.4    |  |  |
| Reduce         | 15.90% 16.70% |           | <b>15.90%</b> | 12.80% | 13.50%     | 12.80% | 26.60%          | 20.50% | 26.80% |  |  |

EDA



#### **Overview**



• Special Net: Buffering between Buffers.

 $T(i,j) = D_{\scriptscriptstyle buf}(i) + D_{\scriptscriptstyle wire}(i,j) + D_{\scriptscriptstyle buf}(j),$ 

 $T'(i,j) = D'_{\mathit{buf}}(i) + D'_{\mathit{wire}}(i,k) + D'_{\mathit{buf}}(k) + D'_{\mathit{wire}}(k,j) + D'_{\mathit{buf}}(j).$ 





• Special Net: Buffering between Buffers.

EDA

$$T(i,j) - T'(i,j) = x \cdot (1-x) \cdot r \cdot c \cdot (\ln 9 \cdot \alpha + 1) \cdot L(i,j)^2 - \beta \cdot Cap_{pin} - \gamma,$$





• General Net: Buffering between Steiner Points.

 $T(i,j) = D_{wire}(i,j),$ 

 $T'(i,j) = D'_{wire}(i,k) + D'_{buf}(k) + D_{wire}(k,j).$ 





• General Net: Buffering between Steiner Points.

$$\Delta T(i,j) = \frac{1}{2} \cdot r \cdot c \cdot [(2 + \alpha \cdot \ln 9) \cdot x^2 - 2 \cdot x] \cdot L(i,j)^2$$

$$+ \{r \cdot x \cdot [(1 + \alpha \cdot \ln 9) \cdot Cap_{pin} - Cap_{load}^*(j)]$$

$$+ \beta \cdot c \cdot (1 - x) \} \cdot L(i,j)$$

$$+ \beta \cdot Cap_{load}^*(j) + \gamma$$

$$-\beta \cdot [c \cdot (1 - x) \cdot L(i,j) + Cap_{load}^*(j)] \leq 0,$$

$$Star i$$

$$Buf k$$

$$(1 - x) \cdot L(i,j)$$

$$Star j$$



• General Net: Buffering between Steiner Points.

$$\widehat{CL}_{stnr}(i,j) = 2 \sqrt{rac{eta \cdot rac{cap_{max}}{2} + \gamma}{r \cdot c \cdot (\ln 9 \cdot lpha + 1)}},$$

#### Table 3: Critical wirelength statistics.

| Cell      | <i>CL<sub>buf</sub></i> | $\widehat{CL}_{stnr}$ | <b>CL</b> <sub>stnr</sub> | x    |
|-----------|-------------------------|-----------------------|---------------------------|------|
| BUFFD-X3  | 157.043                 | 211.842               | 351.185                   | 0.27 |
| BUFFD-X4  | 146.492                 | 206.286               | 275.137                   | 0.31 |
| BUFFD-X6  | 174.212                 | 242.255               | 279.531                   | 0.35 |
| BUFFD-X8  | 187.388                 | 257.016               | 270.459                   | 0.38 |
| BUFFD-X12 | 195.638                 | 269.881               | 254.932                   | 0.41 |
| BUFFD-X16 | 211.98                  | 287.216               | 258.906                   | 0.42 |

#### **Insertion Delay Estimation Model\***

• Estimate the contribution of downstream load capacitance to delay:

IIIT(C

1 1

$$aelay_{est} = LOI(Cap_{load}, Slew_{in}) - \omega_s \cdot Slew_{in}$$

 $\mathbf{C}\mathbf{I}$ 

**C1** 

 $\approx \omega_c \cdot Cap_{load} + \omega_{inherent}$ 





#### **Overview**



#### **Hierarchical Framework\***





#### **Overview**



#### **Comparison of Open Source Test Cases**

Table 4: Comparison between clock tree solutions from ours, commercial tool, and OpenROAD.

| 6        | Latency (ps) |       |       | Skew (ps) |       |       | #Buffers |       |       | Buf Area ( $\mu m^2$ ) |       |       | Clk Cap ( <i>fF</i> ) |       |       | Clk WL (µm) |         |         |
|----------|--------------|-------|-------|-----------|-------|-------|----------|-------|-------|------------------------|-------|-------|-----------------------|-------|-------|-------------|---------|---------|
| Case     | Ours         | Com.  | OR.   | Ours      | Com.  | OR.   | Ours     | Com.  | OR.   | Ours                   | Com.  | OR.   | Ours                  | Com.  | OR.   | Ours        | Com.    | OR.     |
| s38584   | 71           | 76    | 93    | 13        | 8     | 17    | 43       | 43    | 45    | 26.6                   | 26.8  | 39.7  | 904                   | 1019  | 1087  | 3382.0      | 3366.5  | 3478.9  |
| s38417   | 75           | 84    | 94    | 10        | 19    | 16    | 53       | 54    | 55    | 32.4                   | 33.6  | 48.5  | 1083                  | 1235  | 1284  | 3755.6      | 3867.4  | 3870.5  |
| s35932   | 80           | 81    | 100   | 13        | 10    | 17    | 58       | 59    | 64    | 35.5                   | 36.5  | 56.4  | 1217                  | 1380  | 1433  | 4380.0      | 4420.9  | 4449.4  |
| salsa20  | 82           | 87    | 112   | 19        | 21    | 29    | 81       | 83    | 109   | 49.6                   | 51.8  | 96.1  | 1715                  | 2050  | 2160  | 6446.9      | 6580.9  | 6863.5  |
| ethernet | 97           | 104   | 159   | 34        | 31    | 51    | 337      | 352   | 455   | 315.5                  | 320.4 | 408.9 | 7314                  | 8823  | 9210  | 26113.9     | 26105.5 | 27248.8 |
| vga_lcd  | 134          | 146   | 206   | 41        | 49    | 92    | 575      | 597   | 775   | 416.9                  | 451.8 | 812.1 | 12380                 | 14920 | 15815 | 46763.1     | 45969.8 | 47484.1 |
| Avg.     | 1.000        | 1.072 | 1.417 | 1.000     | 1.062 | 1.708 | 1.000    | 1.036 | 1.310 | 1.000                  | 1.051 | 1.668 | 1.000                 | 1.196 | 1.259 | 1.000       | 0.994   | 1.028   |

### **Comparison of ysyx designs**

Table 5: Comparison of ysyx designs among clock tree solutions from ours, commercial tool, and OpenROAD.

| Case   | Latency (ps) |       |       | Skew (ps) |      |       | #Buffers |       |       | Buf Area $(\mu m^2)$ |       |        | Clk Cap (fF) |       |       | Clk WL (µm) |          |          |  |
|--------|--------------|-------|-------|-----------|------|-------|----------|-------|-------|----------------------|-------|--------|--------------|-------|-------|-------------|----------|----------|--|
|        | Ours         | Com.  | OR.   | Ours      | Com. | OR.   | Ours     | Com.  | OR.   | Ours                 | Com.  | OR.    | Ours         | Com.  | OR.   | Ours        | Com.     | OR.      |  |
| ysyx_0 | 113          | 120   | 156   | 31        | 15   | 63    | 639      | 654   | 799   | 550.4                | 561.5 | 1719.0 | 29032        | 31662 | 17130 | 42698.49    | 42935.09 | 45389.69 |  |
| ysyx_1 | 118          | 122   | 206   | 29        | 12   | 110   | 656      | 674   | 863   | 568.4                | 568.3 | 1896.4 | 29455        | 32204 | 17907 | 44538.49    | 45142.38 | 48113.24 |  |
| ysyx_2 | 140          | 143   | 191   | 38        | 16   | 68    | 943      | 958   | 1113  | 801.2                | 814.6 | 2401.7 | 35272        | 39256 | 25491 | 64884.11    | 64901.76 | 69753.65 |  |
| ysyx_3 | 144          | 139   | 193   | 36        | 16   | 59    | 798      | 808   | 914   | 671.8                | 689.0 | 1970.4 | 32483        | 35830 | 21484 | 57229.57    | 56918.98 | 59314.99 |  |
| Avg.   | 1.000        | 1.017 | 1.449 | 1.000     | 0.44 | 2.239 | 1.000    | 1.019 | 1.215 | 1.000                | 1.016 | 3.082  | 1.000        | 1.101 | 0.65  | 1.000       | 1.003    | 1.063    |  |



#### **Topology & Routing**



s38417 Topology

s38417 Routing



vga\_enh\_top Topology



vga\_enh\_top Routing



#### **Overview**





#### Future

#### • Software

- Clock Topology
- Community
- iCTS API
- Flow
  - Timing Representation
  - Clock Routing
  - Buffering Optimization
- Technology
  - DSE
  - CCOpt
  - AI for CTS





#### Reference

[1] Bakoglu H B. Circuits, interconnections, and packaging for VLSI[J]. (No Title), 1990.

[2] Kashyap C V, Alpert C J, Liu F, et al. Closed-form expressions for extending step delay and slew metrics to ramp inputs for RC trees[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2004, 23(4): 509-516.

[3] Sitik C, Lerner S, Taskin B. Timing characterization of clock buffers for clock tree synthesis[C]. 2014 IEEE 32nd International Conference on Computer Design (ICCD), 2014: 230-236.

[4] Bakoglu H B. Circuits, interconnections, and packaging for VLSI[J]. 1990.

[5] Boese K D, Kahng A B. Zero-skew clock routing trees with minimum wirelength[C]. [1992] Proceedings. Fifth Annual IEEE International ASIC Conference and Exhibit, 1992: 17-21.

[6] Andreev A, Nikishin A, Gribok S, et al. Clock network fishbone architecture for a structured ASIC manufactured on a 28 NM CMOS process lithographic node: Google Patents, 2014.

[7] Chakrabarti P, Bhatt V, Hill D, et al. Clock mesh framework[C]. Thirteenth International Symposium on Quality Electronic Design (ISQED), 2012: 424-431.

[8] Abdelhadi A, Ginosar R, Kolodny A, et al. Timing-driven variation-aware synthesis of hybrid mesh/tree clock distribution networks[J]. Integration, 2013, 46(4): 382-391.

[9] Cong J, Kahng A B, Koh C-K, et al. Bounded-skew clock and Steiner routing[J]. ACM Transactions on Design Automation of Electronic Systems (TODAES), 1998, 3(3): 341-388.

[10] Chu C, Wong Y-C. FLUTE: Fast lookup table based rectilinear Steiner minimal tree algorithm for VLSI design[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2007, 27(1): 70-83.

[11] Chen G, Young E F. Salt: provably good routing topology by a novel steiner shallow-light tree algorithm[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 39(6): 1217-1230.



# Thanks

