Adaptive Power Management/Vision
From RAD Lab
Contents
|
A Case for Adaptive Power Management within and across Data Centers
Peter Bodik, Armando Fox, Michael Jordan, Randy Katz, David Patterson, plus cs294-1 students who choose to help
RAD Lab, UC Berkeley
Motivation
Data centers maxed out, utilities very concerned about power efficiency at data centers, CO2 emissions and computing
Examples:
- Google in East Washington (cite NYT),
- Pixar data center limited by cooling (cite)
- Despite lots well-know variation in demand by hour and day of week [...], data centers power does not vary
- Cite House Bill directing EPA Leader to study power efficiency of data centers
- Old conventional wisdom: data centers limited by space
- New conventional wisdom: data centers limited by power or cooling capacity as power density per rack increases
- [Figure? Power density per rack over time?]
- [Citation to % power used in IT???]
- [Dave send to ACM Washington guy]
- Figure: power per component over time, power per rack per rack over time?
- Table of measurements of Berkeley owned computers from last 7 years
Why: unpredictability of effect of reducing power
- Turn off machines => what if load spike
- Run CPU's slower => bad performance?
- lack of good models mapping workload to performance as fn. of power
- lack of good models mapping workload to power consumption
- concerned that power cycling reduces HW reliability (wear and tear)
- even less knowledge about network switches than about computers
Power savings graph: 3 axes
- time: never, weekend/holiday, daily, hourly, minute, seconds, ms
- granularity: ...
use a standard performance benchmark and add power:
- transactions per watt
- power consumed to run the whole benchmark
Current state of the art of power management
- Nothing; cite personal commmunication
- HP data center guys
- HP product people (limit per rack)
- IBM adjustable power of computer by peak and by average
Power consumption of data center components
Power consumption of all typical data center components:
- server (as a whole)
- server components:
- CPU (individual cores, DVS)
- [Figure of Watts per CPU over time, Watts per SPEC int over time]
- [MPU do new power control features]
- disk (different modes)
- [Basically flat over time, but GB capacity/watt improves]
- memory
- [Per DIMM over time?
- network cards
- switches/routers
- A/C
- Table showing idle, peak, avg power of data center components: switches, raid boxes, computers
- Figure(s): for some components, showing power vs. activity (like AMD); need a workload generator per component
- Figures showing Where is the power per tier for some app; lots of front ends, ; where to look to save
- Laptops demonstrate the viability of power aware design from standard components, but servers and data centers have not yet taken advantage of the power saving abilities of these power saving components
- What's the state-of-the-art in server power management (what components can be turned off or have low-power modes)? John Shalf mentioned that in some of the new servers you can turn off individual cores.
Future trends
Companies working on servers where you can turn off cores, memory banks, network cards
Price of electricity
- In California, all bigger data centers are on E-19 (less than 1MW) or E-20 (more than 1MW). Describe E-20.
- Demand response programs.
- Table: examples of current electricity rate schedules (including voluntary options)
- Example ficticious graphs, total amount of power is the same, but vary different charging schemes. Best case, worst case, flat case.
Future trends
- Price of electricity changes every hour (Canada)
- PG&E might soon require all custumers to participate in demand response programs
How to save power
- Turning on and off: how long does it take? what is effect on reliability of component?
- Use lower power modes: how long does it take? how much savings over regular mode?
- Main idea: for each application in a DC, use only the resources we need to keep the performance at a certain level (other resources: put into low-power mode or turn off)
- You can also save power by using more efficient equipment, but that's probably not what we want to do. We should mention it though. -Peter Bodik 9/19/06, 10:29am
Challenges
- Application resource model
For each application we need to understand what resources are needed to keep performance at certain level.
- Applications and workload changes continuously
We need to adapt our models/algorithms accordingly.
- VMs and live-migration
What's the overhead of VMs and migration? What application can coexist on the same server?
How to save the planet
- different parts of the country have cleaner power
How to save money on power
Once we can't save power any more, we can save money by:
- running background jobs in the off-peak hours to avoid excessive demand charges
- moving requests/jobs to data centers with lower price of electricity
Moving workload between data centers
- could save energy: A/C units have to work less in colder weather (ie, night/winter)
- could save money: energy is cheaper at night/winter
Experiments in Power/Performance Tradeoffs
- HP Soccer 1998 World Cup web traces (~ Ebates); project description already in class wiki
- 5 months, showing all traffic
- Know had 30 processors (don't know memory or disks or networks)
- Run it on VMware on n processors of X cluster, show power/CO2/$ consumption under different policies for a given set of equipment, and report the latency and bandwidth
- Only turn large components, on and off, using our increasing finer granuarlity of time
- Run same experiment, this time using Ebates data (plus there are other traces available: see the project description in the wiki)
- Best case savings given using traces: see impact on latency, or vary latency requirements (typical: witin a window of N minutes, must have >X% requests meet the latency requirement); how is that a function of spinup and spingdown time
- Allows suggestions of the benefits of better HW and SW as well as better ML for control
- Other modes than on and off from earlier experiments
- Statistical model resource requirements for applications per tier to reach necessary performance
- Try moving VMs: cost of migration in time and power; do the migrations overselves; who knows what the impact is
- What about multiple apps on multi VMs, replicate during the day, then move to a subset of machines at night; do a best case scenario, say every minutes
- What is the impact of multiple simulataneous migrations on the sender and receiver
- What is the impact on business income of saving electricity as vary the time granularity
- use statistical models to gauge impact of these changes
Fallacies and pitfalls
CW: Fallacy: It will take years before OS will ever support low-power operation (cite manpages)
