Adaptive Power Management/Vision

From RAD Lab

Jump to: navigation, search

Contents

A Case for Adaptive Power Management within and across Data Centers

Peter Bodik, Armando Fox, Michael Jordan, Randy Katz, David Patterson, plus cs294-1 students who choose to help

RAD Lab, UC Berkeley

Motivation

Data centers maxed out, utilities very concerned about power efficiency at data centers, CO2 emissions and computing

Examples:

  • Google in East Washington (cite NYT),
  • Pixar data center limited by cooling (cite)
  • Despite lots well-know variation in demand by hour and day of week [...], data centers power does not vary
  • Cite House Bill directing EPA Leader to study power efficiency of data centers
  • Old conventional wisdom: data centers limited by space
  • New conventional wisdom: data centers limited by power or cooling capacity as power density per rack increases
  • [Figure? Power density per rack over time?]


  • [Citation to % power used in IT???]
  • [Dave send to ACM Washington guy]
  • Figure: power per component over time, power per rack per rack over time?
  • Table of measurements of Berkeley owned computers from last 7 years


Why: unpredictability of effect of reducing power

  • Turn off machines => what if load spike
  • Run CPU's slower => bad performance?
  • lack of good models mapping workload to performance as fn. of power
  • lack of good models mapping workload to power consumption
  • concerned that power cycling reduces HW reliability (wear and tear)
  • even less knowledge about network switches than about computers


Power savings graph: 3 axes

  • time: never, weekend/holiday, daily, hourly, minute, seconds, ms
  • granularity: ...


use a standard performance benchmark and add power:

  • transactions per watt
  • power consumed to run the whole benchmark


Current state of the art of power management

  • Nothing; cite personal commmunication
  • HP data center guys
  • HP product people (limit per rack)
  • IBM adjustable power of computer by peak and by average

Power consumption of data center components

Power consumption of all typical data center components:

  • server (as a whole)
  • server components:
    • CPU (individual cores, DVS)
    • [Figure of Watts per CPU over time, Watts per SPEC int over time]
    • [MPU do new power control features]
    • disk (different modes)
    • [Basically flat over time, but GB capacity/watt improves]
    • memory
    • [Per DIMM over time?
  • network cards
  • switches/routers
  • A/C
  • Table showing idle, peak, avg power of data center components: switches, raid boxes, computers
  • Figure(s): for some components, showing power vs. activity (like AMD); need a workload generator per component
  • Figures showing Where is the power per tier for some app; lots of front ends, ; where to look to save
  • Laptops demonstrate the viability of power aware design from standard components, but servers and data centers have not yet taken advantage of the power saving abilities of these power saving components
  • What's the state-of-the-art in server power management (what components can be turned off or have low-power modes)? John Shalf mentioned that in some of the new servers you can turn off individual cores.

Future trends

Companies working on servers where you can turn off cores, memory banks, network cards


Price of electricity

  • In California, all bigger data centers are on E-19 (less than 1MW) or E-20 (more than 1MW). Describe E-20.
  • Demand response programs.
  • Table: examples of current electricity rate schedules (including voluntary options)
  • Example ficticious graphs, total amount of power is the same, but vary different charging schemes. Best case, worst case, flat case.


Future trends

  • Price of electricity changes every hour (Canada)
  • PG&E might soon require all custumers to participate in demand response programs


How to save power

  • Turning on and off: how long does it take? what is effect on reliability of component?
  • Use lower power modes: how long does it take? how much savings over regular mode?
  • Main idea: for each application in a DC, use only the resources we need to keep the performance at a certain level (other resources: put into low-power mode or turn off)
  • You can also save power by using more efficient equipment, but that's probably not what we want to do. We should mention it though. -Peter Bodik 9/19/06, 10:29am

Challenges

  1. Application resource model

For each application we need to understand what resources are needed to keep performance at certain level.

  1. Applications and workload changes continuously

We need to adapt our models/algorithms accordingly.

  1. VMs and live-migration

What's the overhead of VMs and migration? What application can coexist on the same server?

How to save the planet

  • different parts of the country have cleaner power

How to save money on power

Once we can't save power any more, we can save money by:

  • running background jobs in the off-peak hours to avoid excessive demand charges
  • moving requests/jobs to data centers with lower price of electricity

Moving workload between data centers

  • could save energy: A/C units have to work less in colder weather (ie, night/winter)
  • could save money: energy is cheaper at night/winter

Experiments in Power/Performance Tradeoffs

  1. HP Soccer 1998 World Cup web traces (~ Ebates); project description already in class wiki
    • 5 months, showing all traffic
    • Know had 30 processors (don't know memory or disks or networks)
    • Run it on VMware on n processors of X cluster, show power/CO2/$ consumption under different policies for a given set of equipment, and report the latency and bandwidth
    • Only turn large components, on and off, using our increasing finer granuarlity of time
  2. Run same experiment, this time using Ebates data (plus there are other traces available: see the project description in the wiki)
  3. Best case savings given using traces: see impact on latency, or vary latency requirements (typical: witin a window of N minutes, must have >X% requests meet the latency requirement); how is that a function of spinup and spingdown time
  4. Allows suggestions of the benefits of better HW and SW as well as better ML for control
    • Other modes than on and off from earlier experiments
  5. Statistical model resource requirements for applications per tier to reach necessary performance
  6. Try moving VMs: cost of migration in time and power; do the migrations overselves; who knows what the impact is
  7. What about multiple apps on multi VMs, replicate during the day, then move to a subset of machines at night; do a best case scenario, say every minutes
    • What is the impact of multiple simulataneous migrations on the sender and receiver
  8. What is the impact on business income of saving electricity as vary the time granularity
    • use statistical models to gauge impact of these changes


Fallacies and pitfalls

CW: Fallacy: It will take years before OS will ever support low-power operation (cite manpages)