RADSClassFall06/projects/ideas

From RAD Lab

Jump to: navigation, search

Contents

Project ideas for cs294-1

#1: Power consumption of servers and their components

How does power utilization vary with workload for a web-like service (on a single server) - compare with Partha Ranganathan's results for batch benchmarks.

What's the power overhead of deciding to use virtualization? (running same app on virtualized vs nonvirtualized OS)

How do you compose profiles of apps (when you decide to put multple apps on one server)? Can you compose power profiles of individual apps in such a way that when the apps are put on the same server, the composed profiles are a good predictor of actual power consumption? This is a key ingredient for formulating migration policy.

Besides access to more machines, we will have meters and various fun electrical equipment you can use for measurements.


  • measure blade 40 vs cluster
  • performance vs. power (vs space, vs cost of purchase)
  • use Deter machines for testbed

#1a: Disk-related power bottlenecks in software

Is filesystem caching a net power saver or net power waster? How about filesystem prefetching?

The argument for caching is avoiding re-accessing the disk. Disk caching happens at multiple levels, including DRAM in the disk assembly and system DRAM managed by the OS. The question is, is this really more power efficient? What is the right measure of "power efficiency" for accessing persistent data?

A related question is whether filesystem prefetching is a net power saver or not.

Previous work has looked at aggressive prefetching and caching to allow the disk to be put into idle mode more frequently but did not account for the power needed for the extra DRAM to do this. In particular they claimed 60-80% disk power savings by using 250MB DRAM of system memory for caching/prefetching.

Assuming that in a server environment the disks can never really be spun down, a good rule of thumb is disks consume 10W while spinning idle and 20W if reading/writing.


Possible results could include:

  • Run Andrew benchmark (which stresses both block and file caching) and compare power efficiency with different caching policies (including no-cache), compare both absolute power difference and power/performance difference
  • Compare with a Web-server-like workload (which is read-only and tends to benefit more from whole-file caching)
  • How should future systems be designed to facilitate answering these questions?

#2: Power consumption of network equipment

  • switches & routers
  • ports vs link speed
  • big switch: nortel 96 100Mbit (vs small switch)
  • turn off pieces
  • replace switches with PCs (servers)


#3: cost over time & link power

  • "Katz law"
  • optical vs. copper wire (multimode, distance, ...)


#4: ML (RL) on laptops

  • model of power management of a data center


#5: cost of dependability by sending messages in parallel

#6: estimate of potential power savings

Estimate how much power/dollars/CO2 we could save by turning servers (and/or their components) off. How much could we save by executing requests in a different data center (with cheaper/cleaner power)? How would you handle peaks in traffic? (How long does it take to turn on a server?)

As a realistic workload, use:

  • web server access logs from Ebates.com
    • 5 traces, about 2 months total
    • data from 3 servers (they used total 8 servers)

Approach #1:

  • simply look at the graph of requests per second and estimate when we could turn off a server
  • to estimate the optimal resource use, assume that you can accurately predict the workload (all the peaks, ...) and turn servers on/off based on the predictions. We can't save more power than in this scenario.
  • to be more realistic, at all times keep some spare capacity (like 20%) and turn servers on when the spare capacity drops below the threshold. Depending on the threshold, the web site might slow down a lot when we hit the peaks.

Approach #2:

  • the france98.com web site served mostly static files and from the log files we know the list of all files and size of each file. This means that we can reconstruct a working copy of the web site: create all the files with the correct sizes and put Apache in front of it.
  • we can now run realistic experiments on the web site
  • the logs also contain clientID for each request; we can reconstruct the sequence of clicks from individual users. We can now artificially slow down the web site (and delay the request from a given user depending on how much slower the web site is).
  • how much could we save by running the web site 10%, 20%, 50%, or 100% slower?

#7: Flexible Rails infrastructure

To maximize the resource utilization and minimize power consumption we need an infrastructure that let's us easily add/remove/redistribute resources. A typical Rails application deployed using VMs would use three types of VMs: 1) web server VMs, 2) dispatcher VMs, and 3) MySQL VMs (plus there might be other VMs for software load balancing and handling cache/sessions using memcached). Once you set up the application using VMs, VMWare (or Xen) let you change the allocation of CPU/memory and live-migrate each VM. This means that when the workload increases, we can migrate each VM into a dedicated physical server, when the workload decreases, we can use fewer machines.

This sounds great, but it still won't let us achieve the optimal use of available resources. Here's an example: imagine that you have a Rails application with 3 web server VMs, 10 dispatcher VMs (each running 5 dispatchers), and 1 MySQL VM and assume that during peak load we need to put each VM on a dedicated server to get the performance we need. When the load at night drops to 1% of the peak load, we can handle the requests on a single server. We might be able to migrate all the VMs to a single server, but that has a huge overhead -- we're running 14 copies of an OS while we really just need 3 (1 web server VM, 1 dispatcher VM, and 1 MySQL). Even if we run just 1 dispatcher VM, 1 dispatcher (instead of 5) might be enough -- killing 4 dispatchers would free up memory that could be used for something else. If the load increases above the previous peak, we might need to add another web server or dispatcher VM.

The goal of this project is to add the above-mentioned capabilities to a Rails-based web app and ensure that they have minimal impact on performance of the application.

  • turning VMs on/off: This can be done using VMWare, but the mapping between web servers, dispatchers, and MySQL is hardcoded in the config files and turning a VM off might have negative consequences. When you turn a (web server or dispatcher) VM off, you have to make sure that the previous tier (load balancer, web server) stops sending requests to it; when you turn a VM on, you have to make sure that the previous tier will detect it and start sending requests to it. Also, if you turn off a web server VM, the dispatchers that were used by that web server need to be mapped to one of the remaining web servers; this would probably require changing the config file and restarting the web server. It might be easier to add a software load balancer (such as HAProxy) between web server and dispatchers. With multiple web servers, there should be another load balancer in front of them.
  • changing the number of dispatchers in a dispatcher VM: if you have one dispatcher VM with 5 dispatchers with 3GHz allocated to it and another with just 500MHz allocated to it, it will take much longer to execute requests in the second VM. One of the solutions is to simply reduce the number of dispatchers in that VM. Another is to change the weights in the dispatcher load balancer to send fewer requests to the second VM.
  • adding a new dispatcher/web server VM: if you have 3 web server VMs and you need to add another one, you need to clone one of the existing web server VMs and change the configuration to add it to your application. If you do the mapping between web servers, dispatchers, and database through load balancers, the web server VMs should be identical (and the dispatcher VMs as well).

Machine Learning twist: depending on the type of the application/VM and the workload, each action could take different time to execute and have different impact on the other VMs (and itself). Create statistical models that will let us predict the duration and impact based on any relevant metrics.

#8: Automate Lab 3

This goes with #7. In Lab 3 you had to manually reconfigure/reallocate system components to find a good operating point for a multi-machine ResearchIndex. This project would try to find a way to build a machine learning model -- possibly using reinformcenet learning, artifical neural nets, or decision trees -- that would automate some of the tuning done manually in lab 3. The model would be given, eg, the various parameters and actions to tweak them (eg, how to deploy or undeploy an extra dispatcher, how to get a slave DB running to service additional reads, etc) and would be able to measure metrics of interest like latency vs offered load, thruput vs offered load, etc. The results of lab 4 (costs of taking reconfiguration actions to reallocate resources) would figure into this as well.

#9: path-based analysis using *trace

#10: log/data aggregation using Splunk

#11: Combining visualization with machine learning - clustering

Blaine Nelson and some folks at HP Labs have researched how to include prior constraints provided by human operators when clustering possible causes of system performance failures. For example, an operator might label two data points collected at different times to be symptomatic of the same problem.

Lacking is a good way to present data to operators for labeling. This project would prototype one or more visualization-based interfaces that are operator-friendly and facilitate labeling of large datasets, allolwing different clustering algorithms to be rapidly tried.

#12: What is the power overhead of virtualization?

Server migration facilitated by VMs is an important potential ingredient in future datacenters. But what is the power consumption overhead introduced by virtualizing an application, compared to running that application on the native OS and hardware? How does the overhead vary with workload, ie is the "fixed cost" (in power) of virtualization substantially higher than it is without virtualization?

#13: Advanced queries for load-balancing based on distributed triggers

Recent research has proposed efficient protocols for distributed triggers, which can be used in monitoring infrastructures to maintain system-wide invariants and detect abnormal events with minimal communication overhead. Triggering protocols can be generalized to support advanced queries for hot spot detection in distributed system. A few examples are as follows. Relative triggers fire an alarm if the total workload of servers in set A is b times more than that of set B; any-set triggers fire alarm if the total workload of any a% servers is more than a give threshold; composite triggers fire alarm if the total workload of any a% servers is more than b portion of the total system workload. The more detail description is in section 4 in paper http://www.cs.berkeley.edu/~hling/research/minenet.pdf .

This project would prototype one or more triggers, evaluate their performance (communication overhead, triggering accuracy) under a distributed environment. If you are interested in this project, please contact Ling Huang at hling AT cs.berkeley.edu for more detail.

#14: RAMP related projects

RAMP related projects are listed here.

#15 Networking power

One of the interesting areas for power management in the data center is the networking component. (See recent sigcomm paper on green internet.) At one level, networking equipment are very similar to servers and a lot of optimizations applied to servers can also be applied to networking equipment. However, at another level, there are a lot of interesting differences in the nature of power bottlenecks at individual systems, the choices provided in topology rerouting and the implications of various power control choices. A good project would be to evaluate specific ideas that we have been considering in terms of load balancing and traffic consolidation to aid in networking power management, as well as consider traading off redundancy in topology for more power efficiency. (Talk to me for more details.)

#16 "No power struggles"

power management at various levels of the solution - from chips to data centers - often operate in uncoordinated manners. To make matters worse, there is interference between solutions designed for peak power throttling and average power improvement. Finally, there are complexities introduced in combining power management with cooling management and performance management, each of which have their own versions of intersecting controllers. There is a lot of opportunities for interesting research in combining and co-ordinating these individual control loops for a multi-level TCO-aware power management architecture.