RADSClassFall06/rampdisk
From RAD Lab
CS294 RAD Class Project
Disk and Thermal Emulation in Data Center using RAMP
Zhangxi Tan
Contents |
Final Presentation
Research Abstract
Introduction
As data centers grow in size and popularity, a wide range of classic problems in performance and power emerge as harder and harder to be solved. Even worse, with increasing scale of nodes, those problems are more difficult to be observed in such a large scale distributed system. In the project, we are looking at a hardware emulation technique that helps us study computer systems in data center environment with realistic workload and well controlled experiment. Our approach is FPGA based hardware solution, which exploits the RAMP infrastructure (Research Accelerator for Multiple Processors). With RAMP, we implement softcore processors on FPGA and run real OS. This gives us better timing accuracy and more repeatable result.
Originally, RAMP targets at processor emulation. In our work, we show more interesting features it can offer. They are:
- Local storage emulation
Local storage or disk is an essential part in data center. The disk emulation is mainly composed of two parts: storage emulation and timing emulation. In virtual machine technology, disk is often emulated as a file or a special formatted partition on host system. However, it cannot provide correct timing to guest OS and does not reflect to disk physical specifications (e.g. rotation speed, seek time). In our work, we offer both emulation features and allow integration with the real OS workload.
- System temperature emulation
Since power has become a critic issue in data center, temperatures in data center greatly affect cooling system designs. Besides, investigating how temperature will change as different applications and power saving techniques is a very interesting topic. The temperature emulation is accomplished in our system by system utilization based prediction.
- Time dilation
One of the drawbacks implementing softcore processors on FPGA is that they run away slower than its ASIC counter part. On our hardware, it can only run at 50 MHz. However, machines in modern data center usually have 2 GHz processor. The time dilation we developed makes 50 MHz processor on FPGA “looks like” 2 GHz processor from the view point of target system. Emulated software in target is as fast in emulated time but slow in terms of wall clock time.
Methodology
The system architecture is shown in the following figure. We put one LEON3 processor in FPGA on Xilinx XUP core running at 50 MHz with 90 MHz DDR memory interface. LEON3 is 32-bit Sparc V8 compatible that can run full version of OSes. In our experiments, we run Linux 2.6.11 kernel with couple of applications come from standard Debian distribution.
The hardware system does not come with physical storage, so we emulate the local storage by combining Ethernet attached storage with an analytical disk simulator. To be more specific, the Ethernet attached storage is modified from ATA over Ethernet (AoE) [1], supporting interfacing with a widely used disk simulator called DiskSim [2]. Basically, AoE is implemented as a block device in OS. It captures all ATA command and encapsulates them with Ethernet packet. On the remote side, there is an AoE parser running on a PC that services these ATA requests. The parser sends the request to both physical storage and DiskSim. The first performs actual accesses with the later complete timing information. This information will be returned to emulated target when the access is complete in terms of target execution time. In other words, the emulated time on PC (Disksim and storage) is synchronized with the target processor time.
Temperature emulation is done by Mercury suite [3], which takes periodical sampled system utilizations, including CPU, disk and network, as input to predict temperatures in different part of the system through an analytical physical model (Newton’s laws of cooling). The temperature emulation can be either online or offline. In our architecture, Mercury monitor that collects various utilizations is a user level application on target. It sends standard the utilization report to a centralized analytical emulator every one target second.
To make a slow processor appear faster, we apply time dilation to target processor. The basic timing function in OS is controlled by an external timer interrupt. Usually, for instance in most of Linux implementations, this interrupt happens every 10ms. The OS update a kernel variable, which is called “jiffies” on each interrupt. This variable is also the heart beat and time functions (e.g. gettimeofday) of the OS. All software get their timing information by converting the jiffies to normal time unit (i.e. millisecond, second). To perform time dilation, we reprogram the hardware timer on LEON but leave OS modified, so that the intervals between every two timer interrupts are longer. From the target viewpoint, between every two jiffies, more instructions can be completed. Here we define a time dilation factor, which is the ratio of the modified timer interrupt interval versus its original value. If the time dilation factor is 5, then we “boost” the target clock frequency by 5. Therefore, to run 1 second program on this dilated system, we need spend 5 seconds of wall clock time. However, this simple dilation makes every pieces of the system 5X as fast. To offset this effect, we slow down the memory and disk emulator accordingly.
Experiments
To illustrate the prototype, we run several CPU and disk intensive programs with different time dilation factors that emulate processors from 50 MHz up to 2 GHz. We select Seagate Cheetah 9LP disk as target disk model. This is a 10K RPM disk with 5 ms average seek time. There is only one partition in target disk formatted with ext3 file system. The accuracy of this disk model in Disksim has already been validated by equivalent physical measurement. In order to do temperature emulation, we use the physical layout profile of Dell PowerEdge 2850 server (3 GHz Xeon with 10K RPM SCSI disk). This profile is also a pre-calibrated one in Mercury suite. We run the programs with following order:
- Dhrystone: CPU intensive benchmark
- Postmark [4]: A disk intensive filesystem benchmark
- A standard Unix command with pipe, both CPU and disk intensive:
cat alargefile | grep 'a search pattern' > resultfile
Postmark benchmark is configured to perform 1000 transactions with read/write block size as 512 bytes. The large file we used in Unix command is a random generated test bench for Penny sort. The size is 100 Mbytes.
Results
The following figure shows the Dhrystone MIPS and time per Dhrystone loop costs with different target processors frequencies. Here, we do not include memory time dilation. The Dhrystone result scales linearly with time. The emulated 2 GHz processor has a MIPS of 546, while a modern x86 3 GHz processor has Dhrystone MIPS over 8000. This is because LEON has CPI greater than 2, while most of modern out-of-order execution processors have a CPI below 1. Simply calculating the time dilation factor based on frequency is not enough.
If we add memory dilation and assume each memory access takes 10 50-MHz processor cycles, the new Dhrystone result is as follows. From the figure the performance of time dilated system is bounded by memory performance. Since the memory access latency is only a roughly estimation, when perform memory its performance is a little bit over throttled.
Postmark benchmark result is given in the following figure. However, due to some hardware time out issue on Ethernet core. We disabled the memory dilation in the rest of experiments. We noticed that the disk performance at 2 GHz target frequency is only half of a modern SATA disk on the same benchmark (20-22MB/s).
One interesting point is the disc performance improved faster than those of processor when increasing the time dilation factor. To investigate this, we plot the average time spent in Disksim and storage on PC as well as the average round trip time per disk access. It’s quite obvious from the figure, when there is no time dilation, most of time are spent on transmitting packet back and forth over network. This is observed from target’s view point. When the time dilation is increased, as expected the time spent on network appears shorter and shorter. When the time dilation factor is 40, i.e. 2 GHz target processor frequency, over 99% time is spent for actual disk “read/write” in remote simulator. On the contrary, the time spent in remote simulator increased gradually. This is due to the fact that the actual execution time it spends on the remote appears shorter and shorter, as seen by target. For example, the average disk access time we simulated in Disksim is around 2.8 ms. This operation will be completed within 7ms in wall clock time. If the dilation factor is 40, the disk simulator completed with in 7ms /40 = 175 microsecond in terms of target time. Therefore, we need to “delay” the disk simulator to prevent it from running too “fast” ahead. Hence, more system call on remote PC will be performed to set a delay timer. Besides, with the round trip time decreased to less than 10 ms, the smallest time unit in OS, the measurement in software is less accurate. On the overall, the emulated time in total is a little longer than it should be.
The following figure shows the emulated total disk read/write time in our experiment. With an exception on the one without time dilation, the emulation results are pretty deterministic. The reason it took long is because this we perform some extra disk operations like mount file systems, edit experiment scripts and etc.
The temperature emulation is as follows. The emulator output CPU and disk utilizations in addition to emulated chip temperature, disk temperature. The upper one is CPU temperature emulation and the lower one is the disk temperature emulation. From the graph, CPU temperature increased dramatically with the utilization. The CPU utilization went down with disk utilization increased when time dilation facture goes larger. This also further proves the emulation overhead is offset and the accuracy improved. It also suggests that the disk temperature remains quite stable throughout the experiment. We also observed a faster completion target time when emulating higher frequency processor.
Limitations and future work
In this work, we show the basic emulation system and its basic functionality. However, several limitations still bother us on getting more faithful result. The first problems come from the AoE protocol, which is limited by the size of Ethernet packet. The maximum disk access is in 2 sectors. Another problem is we also run into earlier in Dhrystone result. Naïve time dilation appears less accurate. But, time dilation seems still a promising technique. The emulation overhead can be shadowed with higher dilation factor. What’s more, it waives the need of fine grained clock control and avoids deadlock when communicating with different clock domains.
Reference
1. ATA over Ethernet, http://www.coraid.com/documents/AoEr8.txt
2. The DiskSim simulation environment, http://www.pdl.cmu.edu/DiskSim/
3. T. Heath, A. P. Centeno, P. George, L. Ramos, Y. Jaluria, and R. Bianchini. "Mercury and Freon: Temperature Emulation and Management in Server Systems". Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 2006
4. Postmark benchmark, http://www.devone.org/linux/postmark.html

