Timing speculative (TS) architecture is promising for improving the energy efficiency of microprocessors. Error recovery units, designed for tolerating occasional timing errors, have been used to support a wider range of voltage scaling, therefore to achieve a better energy efficiency. More specifically, the timing error rate, influenced mainly by data forwarding, is the bottleneck for voltage down-scaling in TS processors. In this paper, a new Timing Error Aware Register Allocation method is proposed. First, we designed the Dependency aware Interference Graph (DIG) construction to get the information of Read after Write (RAW) in programs. To build the construction, we get the disassemble code as input and suppose that there are unlimited registers, the same way as so-called virtual registers in many compilers. Then we change the disassemble codes to the SSA form for each basic block to make sure the registers are defined only once. Based on the DIG construction, registers were reallocated to eliminate the timing error, by loosening the RAW dependencies. We construct the DIG for each function of the program and sort the edge of DIG by an increasing weight order. Since a smaller weighted-edge value means that its owner nodes have more frequent access in instruction flows, we expect it in different registers with no read-write dependency. At the same time, we make sure that there are no additional new spill codes emerging in our algorithm to minimize the rate of spill code. A high rate of spill code will not only decrease the performance of the system but also increase the unexpected read-write dependency. Next, we reallocate the registers by weight order in turn to loosen the RAW dependencies. Furthermore, we use the NOP operation to pad the instructions with a minimal distance value of 2. Experiment results showed that the average distance of RAW dependencies was increased by over 20%.
Energy efficiency is an important issue. A number of works in different aspects have been completed [
Compiler is an important system software and numerous optimization schemes are employed to generate optimized code for different architectures. In our previous work [
In recent years, an abundant amount of work has been done to improve the performance of TS processors by reducing timing errors in hardware architecture. Ernst et al. [
There is also research from the software side that considers optimization for TS processors such as code transformation at the compiler level. Meixer et al. [
In this paper, we propose a novel timing error aware register allocation method to reduce the energy consummation of TS architecture by reducing the distance of read-write dependencies: First, a heuristic register allocation method was designed to reschedule registers based on the distance of the read-write dependencies in each basic block. Then, an inter-block optimization algorithm was implemented to further increase the distance of read-write dependencies by considering the dependency between basic blocks. Experiment results showed that the proposed technique could increase the distance of read-write dependencies by 20% on average. Timing error could thus be reduced and an average energy saving ofapproximately15% could be achieved.
The primary contributions of the paper are as follows: Analyzes the distribution of RAW dependencies of the whole program and designed the Dependency aware Interference Graph (DIG) construction to express the relationship of RAW. Designs a heuristic register allocation method to break and increase the distance the RAW dependencies. As a result, the average distance of RAW is increased significantly. Implements the algorithm based on the GCC compiler. Experimental results verified that the average distance of RAW can be increased significantly. Error rate reduction leads to significantly energy saving by further voltage scaling.
The remainder of the paper is structured as follows: In Section 2, we introduce background information and a motivation example about the RAW reduction algorithm firstly. Then, the RAW dependencies optimization algorithm is described in detail. In Section 3, we present the evaluation methodology and the experimental results. The related work and conclusion are presented in Section 4 and Section 5, respectively.
Timing Speculation (TS) processor can improve the energy by more voltage scaling supporting. Razor [
In cycle 2 in
If an error occurs in pipeline stage L1 in a particular clock cycle, the data in L2 in the following clock cycle is incorrect and must be flushed from the pipeline using one of the pipeline control methods such as clock gating, counter flow pipelining, and so on. However, since the shadow latch contains the correct output data of pipeline stage L1, the instruction does not need to be re-executed through this failing stage. Thus, a key feature of Razor is that if an instruction fails in a particular pipeline stage it is re-executed through the following pipeline stage, while incurring a one cycle penalty. The most of the proposed approaches therefore guarantee forward progress of a failing instruction, which is essential to avoid the perpetual failure of an instruction at a particular stage in the pipeline.
In addition to invalidating the data in the following pipeline stage, an error must also stall the preceding pipeline stages while the shadow latch data is restored into the main flip-flops. A number of different methods, such as clock gating or flushing the instruction in the preceding stages, were examined to accomplish this.
TS architecture is a promising concept for microprocessor energy efficiency. By using recovery mechanisms to tolerate the occasionally occurring timing errors, it can work in a lower supply voltage to obtain considerable energy saving. Recovery mechanisms are usually energy demanding in TS processors. Minimizing timing errors can significantly improve the energy efficiency of TS processor. As shown in the reference Sartori et al. [
Read-write dependencies area common dependency in programs that break the pipelining and causes data hazard. Let instruction distance
The Read-write dependency will occur if and only if
In order to achieve aggressive optimization for the whole program such as the library of system, we make our optimization process get the disassemble codes and the profiling results as inputs. Then
From this outline, we can see that the kernel of the optimization is
The goal of this work is to loosen the dependency of read-write operations in the TS processor. Therefore, the more frequently-accessed patterns of register pairs there are, the more important the register pairs are. For better illustration of the dependency accesses frequency feature, combined with the register allocation, we attempt to enhance the original Interference Graph that is widely used for register allocation and we construct the new dependency aware interference graph, called
For building the
Input: source disassemble codes, S; | |
Output: The profiling results, |
|
1: | |
2: | for each v ϵ |
3: | Translate v to SSA form |
4: | |
5: | |
6: | |
7: | for each v ϵ |
8: | |
9: | for each |
10: | |
11: | |
12: | |
13: | |
14: | |
15: | |
16: | for each |
17: | |
18: | |
19: | |
20: | |
21: | |
22: | return |
To reduce the cost of a spill node is complex work because it changes the source order of instruction by inserting extra spill codes that will make the
Based on the above
Then, we analyze the ordered edges one by one to finish the register allocation for each node (line 2-36). For each edge
the available registers R={r0,r1, •••,rn} | |
1: | E' := sort EN by decreased order in |
2: | |
3: | e(u,v) :=pop the first element of E' |
4: | |
5: | |
6: | get the register rj ≠ ri |
7: | |
8: | |
9: | |
10: | rj := rk where rk ϵ R ' , e (n,u) |
11: | |
12: | |
13: | assign the two nodes for original regisers. |
14: | |
15: | |
16: | ri:= get the random register that node v can be used. |
17: | |
18: | get the register |
19: | |
20: | |
21: | |
22: | |
23: | assign as line 5 – 7 |
24: | |
25: | assign as line 15 – 19 |
26: | |
27: | |
28: | |
29: |
To further increase the distance of read-write dependency, we use the
Benchmarks from each category in Mibench [
In the experiments, the GCC compiler is modified by first obtaining the profiling information with O2 optimization option. Then, we disassemble and get profile information for the binary codes respectively. After getting the disassemble codes and profile information, we use them as input for the DIGRR processor to get the timing error aware optimized binary codes. Finally, we compare the source binary codes with the optimized binary codes to evaluate the performance of DIGRR from four aspects: the average distance of read-write dependency, the minimal distance of read-write dependency, the cost to get the minimal distance of read-write dependency, and the energy saving values.
To evaluate the effectiveness of the proposed algorithm in overall, we compared the average distance of the read-write dependency before and after
The experiment result shows that we achieve, on average, up to 20% read-write reduction compared to the original
To evaluate the worst situation of our proposed algorithm, we compared the minimal distance value and its occurrence times obtained under
From the table, we can see that the minimum distance value of read-write dependency in
It is obvious that the number of occurrence times for the minimum distance of read-write dependency is significantly reduced (
Benchmarks | Min | Min | CMin | CMin |
---|---|---|---|---|
Basicmatch_large | 3 | 4 | 421 | 241 |
Basicmatch_small | 2 | 5 | 334 | 220 |
Bitcnts_large | 1 | 4 | 507 | 295 |
Bitcnts_small | 2 | 3 | 777 | 59 |
qsort_large | 2 | 5 | 307 | 143 |
qsort_small | 1 | 3 | 648 | 287 |
susan_s_large | 2 | 5 | 588 | 211 |
susan_e_large | 2 | 3 | 427 | 129 |
susan_c_large | 2 | 3 | 143 | 87 |
susan_s_small | 3 | 4 | 861 | 265 |
susan_e_small | 1 | 5 | 980 | 108 |
susan_c_small | 2 | 3 | 371 | 138 |
jpeg-6a/cjpeg_large | 1 | 3 | 786 | 270 |
jpeg-6a/djpeg_large | 3 | 4 | 783 | 219 |
jpeg-6a/cjpeg_small | 2 | 3 | 740 | 76 |
jpeg-6a/djpeg_small | 2 | 5 | 349 | 121 |
lame3.70/lame_large | 2 | 3 | 649 | 174 |
lame3.70/lame_small | 3 | 5 | 99 | 96 |
dijkstra_large | 1 | 4 | 636 | 224 |
dijkstra_small | 1 | 5 | 437 | 299 |
patricia_large | 3 | 4 | 829 | 124 |
patricia_small | 2 | 4 | 439 | 77 |
ispell | 1 | 4 | 919 | 258 |
search_large | 1 | 4 | 208 | 255 |
blowfish_en | 3 | 4 | 879 | 248 |
blowfish_de | 2 | 5 | 901 | 202 |
rijndael_en | 1 | 5 | 196 | 166 |
rijndael_de | 3 | 5 | 314 | 86 |
sha | 1 | 4 | 942 | 251 |
bin/rawcaudio(adpcm_c) | 2 | 3 | 245 | 257 |
bin/rawcaudio(adpcm_d) | 3 | 4 | 433 | 117 |
crc | 3 | 3 | 977 | 269 |
fft_i | 3 | 3 | 294 | 71 |
fft | 1 | 4 | 146 | 296 |
bin/toast | 3 | 4 | 488 | 132 |
bin/ untoast | 3 | 5 | 245 | 167 |
average | 2 | 4 | 536 | 184 |
In order to increase the distance of read-write dependency, the
From the experiment results, we can see that the timing cost and the space cost are both lowered. They are both about 0.5% on average. Therefore, the cost impact of our algorithm on the performance and size of the program is very small.
In order to evaluate the impact of this method on the energy consumption of TS architecture processors, a comparative analysis of the code generated by
The experiment results demonstrate that our new register allocation method effectively increases the average raw dependent distance by making a lot of optimized, speculative executions allowing correct results, and greatly reduces the energy consumption wasted by speculation errors. Therefore, we have a very good auxiliary solution for TS architecture processors, especially for applications attempting a low energy consumption.
TS architecture is a promising trend for microprocessor development to improve the energy efficiency. Reducing the read-write dependencies can reduce timing error which in turn make more voltage down-scaling for TS processors. Optimized voltage scaling is beneficial for reducing the energy of processors. In this paper, we provide a