Cs294-spring08/Discussion
From RAD Lab
May 7: Postmortem/Reflections
Software stack for 21st century:
- Program = Web 2.0 app
- Client OS = browser
- Server = virtualized cloud computing
- Browser is new client OS: memory management for JS, scheduling/async threads, security liability
- Rob claims no 80/20 rule, same as OS claim that led to Microkernels => what will "browser microkernel" be?
- What about cross-browser incompatibilities and poorly-defined "corner cases" where different specs interact (eg CSS + Javascript)?
- What about cases like the Javascript bugs survey (timers, bad assumptions about DOM not changing over time, etc) - how do we create dependable apps under these circumstances?
- What's left for traditional client OS? Local storage (but Google Gears allows disconnected Web 2.0 apps), native apps
HaaS/Utility computing:
- Can buy latency as well as thruput: cost of 1000 machines x 1 hr = cost of 1 machine for 1000 hrs
- Run batch jobs with max parallelism, since cost is same as running longer with less parallelism? If you really have truly idle machines, un-lease them.
Networking:
- Why build intra-DC network with outside-DC technology?
- + handles bandwidth
- - extra cost, unneeded functionality can get in the way
- In DC, you control topology => design for specific assumptions, eg 64-node destination-addressed hypercube vs generalized N^2 routing
- May even want to tweak/replace protocols (beyond just tuning TCP/IP)
Server software:
- Goal: solve any SW problem by adding or dropping machines
- "Fire" bad servers daily (instead of trouble tickets)
- "Lay off" servers when load drops, "rehire" when rises
- Implication: can do cost/performance for reliabliity
- Cheap hardware which can fail; clever software compensates
- Cost of programmers vs. cost of machines - compared to programmers, HW is "free"
Storage as a Service (persistent and time-limited persistent):
- Today: lots of point solutions (S3, Bigtable/Python, SimpleDB, Salesforce/Apex...)
- "S3 means no VC" - pay as you go, even in small quantity
- How bad is it?
- IMDB runs nightly batch job rather than try to process updates; MySQL cluster (handles votes) only 92% available, master-slave replication is buggy
- Facebook has ~2k MySQL servers, 25TB RAM cache, hacks to keep consistency
- StaaS must deal with typical failures (eg yearly from Google):
- ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)
- ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)
- ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)
- ~1 network rewiring (rolling ~5% of machines down over 2-day span)
- ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
- ~5 racks go wonky (40-80 machines see 50% packetloss)
- ~8 network maintenances (4 might cause ~30-minute random connectivity losses)
- ~12 router reloads (takes out DNS and external vips for a couple minutes)
- ~3 router failures (have to immediately pull traffic for an hour)
- ~dozens of minor 30-second blips for DNS
- ~1000 individual machine failures
- ~thousands of hard drive failures
- Debugging may be harder because problems not noticed til much later: Data loss never discovered immediately -> Need to reconstruct state days later
- Masking failures still noticeable if using probes of known performance to test system continuously
Tracing in the datacenter:
- SaaS: you're "always" debugging, so tracing must be on in production. So tracing must be effective and economical.
- Challenge: potentially expensive to collect; where and how long to keep trace data; how to query it.
Apr 30: Poster Sessions
The class CS294-23, Software as a Service [taught by David A. Patterson and Armando Fox] will have their student presentations on Wednesday 4/30 from 1:30-4pm in the RAD Lab (465 Soda Hall).
Michael Armbrust and Gunho Lee - Title: Analyzing the Performance and Variability of the EC2 Platform
Abstract: Amazon's Elastic Compute Cloud (EC2) opens new possibilities for systems researchers to run their experiments at scales that were never before possible. The idea of being able to request 1000s of machines for only $100 is very enticing. There are however many valid concerns about running sensitive experiments on unknown hardware under virtualization. For our project we characterize the performance and variability of EC2 and make recommendations to researchers hoping to use it.
Andy Konwinski and Matei Zaharia - Title: An Application Level Analysis of Large Virtualized Datacenter Environments
Abstract: In the Internet datacenter of today, distributed applications manage thousands of commodity machines and deal with node heterogeneity, load balancing, and node failures. This complexity makes data center applications difficult to debug, optimize and monitor. In the internet datacenter of tomorrow, OS level virtualization will be used to address some of these complexities. However, the trade-offs of this added level of abstraction in the data center are not yet well known. We discuss our work on using a heavily instrumented version of the Hadoop MapReduce platform running in one such virtual data center environment: Amazon's Elastic Compute Cloud (EC2). Through application level benchmarking using MapReduce jobs in Hadoop on EC2, we identify several performance considerations of the virtualized datacenter, such as artifacts of the EC2 environment and the serious implications of variance due to virtualized resource sharing on Hadoop performance.
Ryan Waliany - Title: Node Allocation using Amazon's Elastic Computing Cloud (EC2)
Abstract: We analyze Amazon's Computing Cloud (EC2) and question whether differences exist between allocated machines. Primarily, the goal of the project is to characterize the difference between the machines that are allocated. It also questions whether there exist "bad" servers which can potentially lead to stragglers and negatively influence performance. Once a server is characterized, we analyze the effectiveness of a replacement strategy to allocate a set of "optimal" machines. The project then extrapolates the projected performance of using such a strategy.
Jingtao Wang - Title: A Taxonomy of Bugs in Open Source JavaScript Projects
Abstract: With the popularization of AJAX enabled websites in the past a few years, JavaScript has become the dominant client-side language for enabling effective in-browser interactions. In this talk, we present a quantitative study of nearly two hundred bugs in two open source AJAX projects. We find that most bugs are strikingly different in nature from those in traditional programming languages such as C/C++ or Java. We categorize JavaScript bugs into four categories based on the causes and the severity and propose nine programming principles for writing correct and secure JavaScript codes.
Raluca Sauciuc - Title: Testing JavaScript
Abstract: Developing JavaScript for web applications is hard, historical reasons including browser incompatibilities as well as bugs in the implementations. But even assuming an ideal platform, fully implementing the HTML/DOM/CSS specifications, the event model and the higher-order features of the language make writing JavaScript code a difficult task for the typical web developer. Testing the functionality is tedious and corner cases are often overlooked, leading to bugs ranging from UI inconsistencies to security vulnerabilities. In this talk we look at possible ways of overcoming this complexity and automating the testing process. We propose symbolic execution and model checking as two techniques for automated test case generation.
Apr 23: Another Project Checkpoint, Scribed by Kuang Chen
- Schedule
- 4/23 update
- 4/28 no class
- 4/30 talks
- 5/5 paper
- 5/7 reflections
- send talk title/abstract to kattt, sammon@eecs
EC2 Performance
- (Ryan) Summary: when requesting EC2 servers, might get some bad ones. What does bad mean? How to measure? How many to replace?
- New measurements w/ Matei and Michael
- Take aways:
- Cold start writes
- take 3-4 x time
- have significant variance
- Warm start write times comparable between 2.0 and 2.6 ghz machines
- Don't rely on cold start nodes
- seems, if 1st write 5gb - writes of 5gb afterwards considered "warm"
- investigate how repeatable is bad performance?
- Cold start writes
- (Michael)
- Larger shared hosts have higher variance in "fair sharing"
- Coarse granularity resource allocation: ghz of cpu
- In measurementts, we see bimodal bandwidth at 500 and 200 MB/s, indicative of type of machine we're on.
- Detail: in a given availability zone, ~400 machines free?
Javascript Testing
- Concolic Testing for JavaScript
- Currently working on executing all JS paths on a page and enable search over the paths, and see what other paths are possible.
- Problems
- For a call like DOM.write, it's hard to see what the browser actually does with the DOM object -- so ignoring DOM functions for now.
- Assumption that "this" references in JS binds to current object is not safe! In case when events are bubbled up.
- How often does this happen? Probably not often, so ignoring for now
- Javascript bugs
- Last week, summarized 14 principles about JS bugs. Condensed to 9 JS specific catagories.
- 3 most dominant catagories account for 60% of all bugs:
- Incorrect assumptions
- Browser specific behavior
- Behavior of a node in DOM tree
- e.g. relative and absolute positioning between parent and children
- Checked correlation between time-to-fix with type category: found none.
- Next: see correlation between time-to-fix and severity.
Apr 14: Midterm Project Checkpoint Scribed by Jingtao Wang
- EC2 Performance(work-in-progress)
- Wrote some very convenient scripts to upload tasks to different VMs and capturing the data automatically.
- Now running tasks on 20 nodes, will scale the tests to 900 nodes in the end (maximum possible nodes for each user allowed is 1000)
- EC2 Scalability
- 950 maximal nodes tested. Amazon won't allow a user to allocate more than 1000 nodes.
- Only around 450 physical machines were identified when using 950 nodes
- There are 1 - 7 VMs on each physical machine
- It seems that the VM manager doesn't partition IO bandwidth wisely(or fairly), once a VM receives a low bandwidth, it will be screwed for ever.
- Planning to try more configurations on the VMs.
- Planning to figure out some guidelines on whether one should pay more for more VM instances.
- Concolic Testing for JavaScript
- Successfully hacked the JavaScript VM in firefox. Can get all the events from javascript or webpage to the VM.
- The number of tests needed is bigger than expected.
- Defining a long term goal for this project. At the current stage, planning to use coverage rate as the major indicator of effectiveness.
- Suggestions: Use some existing JavaScript Libraries as testing data. Depending on the progress, here are a few recommended test methods. First step, run the tests on some code with known bugs to figure out the detection rate, second step, inject some bugs on purposely and then try to detect them. third step, use this test to detect real bugs in open source projects.
- Suggestions: Have a discussion with the the other JavaScript project in this class, try to leverage some of the knowledge on bugs in open source projects.
- Taxonomy of bugs in open source JavaScript projects
- Grouped existing bugs into four categories (Logical Bugs 6% Behavior Bugs 57% Robustness Bugs 36% Security Bugs %1)
- Summarized Severity of bugs a. Leaking session keys, password hash and private profile b. Freeze the browser c. Unexpected termination of program (Unexpected exception dialog box) d. Memory leaking e. Program not doing what’s expected to do 6 Visual glitches
- Came up with 13 preliminary principles for effective JavaScript programming (the top 3 account for 60% of the bugs in open source projects investigated)
- Todo: Charts of bug distribution based on severity, how long does it take to fix each type of bugs. Cross validate the principles on other JavaScript projects.
- Suggestions: Rephrase some of the principles (e.g. principle 1), isolate JavaScript specific principles from generic programming principles.
- Machine Learning for SaaS (special thanks to Ryan)
- Identify some properties/measurements that can be used to rate machines (good vs. bad) for certain tasks.
- Allocating a set of optimal machines using such kind of distributions for either a short-term process or long-term.
- Now trying to identify what is the most efficient way to allocate machines, or how should we do this?
- The potential ramifications of allocating machines from a server pool with regard to the failure rate, bottlenecks due to virtualization, and network failures and potential techniques that can be used to select an "optimal set" of machines.
Apr 2: Project discussions Scribed by Jingtao Wang
- Concolic Testing for JavaScript
- Working on the source code of Firefox, trying to figure out how to inject testing code to the JavaScript engine
- Hacking Firefox is harder than expected - need to figure out when and how the JavaScript VM is launched, what's the behavior of multiple JavaScript VMs.
- Suggestions: Try to leverage the JavaScript debug interface for firefox. Look at open source projects such as Fireclipse and Firebugs.
- Concerns: the debug interface may not expose all the information needed for Concolic Testing, may still need to hack the firefox source code anyway.
- EC2 testing
- Figuring out how the real time clock in VMs work is important in benchmarking.
- Measurements of the gettimeofday in virtual machines.
- The timer in VM Can achieve second level accuracy, not millisecond level accuracy
- Benchmarking results indicates that a. the accumulative error on time seems to be predictable (the slop is linear), maybe it's possible to create an algorithm to compensate that b. the timer is being calibrated at fixed time intervals.
- Suggestion: there is a document that describes how the time in a VM is calibrated from VMware. The timer calibration behavior might be deterministic (e.g. caliberate when certain API is invoked, after certain time has passed)
- EC2 Scalability
- Ruby scripts with 100 nodes. The goal is to scale to 500 - 1000 nodes
- Images of VM vs Images of hard drives
- The type of testing workloads are very important for benchmarking projects
- Current: Hadoop examples (pi,write big files, other examples in the MapReduce paper etc)
- Future: more realistic jobs that can represent real-world workload. (web indexing in Nutch, waiting for Facebook's Hadoop based database.
- Suggestions: share at least part of the data on the web
- Taxonomy of bugs in open source JavaScript projects
- Bugs from subversion logs in two open source projects (around 190 totally at this time)
- Trying to categorize them into 10 - 15 groups and come up with corresponding principles
- Quite different from existing programming principles for C/C++/Java
- Will cross validate these principles by checking additional open source projects (manual at the beginning, might be semi-automatic depending on the progress)
- create charts about the history and distribution of bugs
- Suggestion: categorize bugs by severity - all bugs are not equal. Try to identify when a bug was reported and when a bug was fixed (may need to look at logs of the bug-tracking software if there is one for any project investigated.
- Machine Learning in SaaS
- Track disk latency (monitoring latency graphs), and then predict latency changes and allocate more machines to provide better disk latency when necessary. When the workload becomes low, deallocate spare machines
- Need to figure out how to restart the running process when adding new machines
- Need to distinguish whether we need to add new machines or replace broken machines.
- Might be interesting to investigate some MapReduce workloads.
Apr 2: Project discussions (Scribe: Sameer Iyengar)
Concolic Testing
- Instrumenting Firefox
- Larger trace than expected because of internal Javascript to run browser
- Suggestion: Possible to debug browser itself?
- Distinguish between JS runtimes
- Suggestion: Possible to look for onLoad handler?
- Suggestion: Contact Rob O'Callahan or Firefox developer list
- Suggestion: Look at Firebug or other tools that use the FF debugging API
EC2 Performance (Gunho/Michael)
- What does time mean on EC2?
- clock: number of CPU ticks
- gettimeofday: supposed to be actual time of day
- Issue: VM not always on the run queue --> time not accurate
- Can trust time on the minute-level, not on second-level
Scaling on EC2 (Matei/Andy)
- Reduce startup time
- Goal: 1000 node run
- Plan to run VM containing Ruckus
- Pushing scaling of Hadoop and measuring using xtrace
- Using Hadoop examples: computing pi, writing large files, sort
Bugs in JS projects
- Identified 10-12 common categories of bugs
- Most common relating to DOM operations
- Some may not be problematic, but some may cause security vulnerabilities
- Suggestion: identify by severity
- Suggestion: more structured bug tracker to eliminate minor bugs or classify severity
- Possibly create a debugging tools to pinpoint problem areas
- Analyze severity and amount of time to fix
- SVN reports when bug was introduced and when it was fixed
- Suggestion: mine bug tracker information to find when bug was discovered.
Using ML to dynamically allocate machines (Ryan)
- Track disk latency over time, predict latency changes and allocate new machines to alleviate issues
- Question: How to restart process when an overloaded cluster is detected. Processes that are already running would need to be restarted.
- Possibly use live VM migration (not supported on EC2)
- Distinguish between cases where a machine must be added vs. replaced
- Hadoop has heuristics to improve performance
Mar 19: Project Discussions
Kurtis's
- Michael/Gunho: EC2 performance metrics
- time of day
- stability
- benchmarks
- CPU (performance of machines they give)
- mbench
- Memory (contention causes problems?)
- mbench
- Disks (contention is well known issue here)
- What are the disks?
- SAN/NAS? Local?
- Warm/cold differences? Caching?
- Bonett? (Michael needs to clean up handwriting)
- Network
- Latency
- Bisection bandwidth between nodes?
- Topology (run tracert)
- 500 mbps locally, 100-200 across datacenter
- CPU (performance of machines they give)
- problems?
- clock lies, how do we accurately measure?
- this seemingly is not a problem for disk
- clock lies, how do we accurately measure?
- questions
- colocation effects, gaming the system
- processor classes
- scalability of ec2?
- mostly a non-issue, as the system is way too large
- cost of virtualization is what we are measuring
- Andy/Matei: Application performance metrics on EC2
- using above work to modify code to run better on EC2
- Focus on hadoop
- instrumented with tracing and telemetry data
- compare Berkeley traces (on our HW) to EC2 traces
- use Michael's work to explain differences
- need to accumulate use cases for hadoop on EC2
- NYT does crunching one 100 nodes
- lawyering startup structures unstructured data with 8 nodes
- optimal parameters for EC2 hadoop setup
- current testbed
- sort
- pi (and other default tests shipped with hadoop)
- scaling of hadoop (>80)
- only yahoo does it. Plus they don't publish
- moderately well known it doesn't scale all that well
- questions
- scaling of hadoop
- Concolic Testing of Javascript
- Add tracing to firefox JS engine
- use Selenium to automate testing
- tracing informs testing, which informs tracing...
- alternate method: fork each testing when a decision is madex
- Trick is the dynamic type, applicable to python as well
- questions:
- proper testing case?
- test common libraries (or at least apps using them)
- Make me feel bad about scribing.
- Properties of javscript bugs
- literature search.
- bugs in other systems, javscript unrepresented
- examples:
- script.ackulous (sic) 110 bugs
- spry 80 bugs
- 70% of total bugs are "special cases", abnormal uses
- 10-15% browser specific features (memory leak, transparancy)
- not logical errors, look and feel errors
- discussion regarding effectiveness of this vs selenium
- literature search.
- Machine learning on EC2 data
- modifying model to alleviate the concerns discovered
- try to optimize resource usage
- how to discover best nodes
- feature set difficult to discover
- have type, time, utility
- have option of staying, or moving
- use monte carlo methods to optimize
- boot vs active
- suggestions that application space is important
Jeremy's
- Michael/Gunho: EC2 performance metrics
- What is EC2, how good is it?
- Professors say it’s not predictable, Thorsten says it’s great.
- Does contention become an issue? People in shared clusters avoid running things on the same servers.
- Benchmarks:
- Is the CPU actually proportional?
- Is there contention for memory? What about l2 caching issues?
- Disk – everyone consistently says this is a significant problem
- What are the disks – SAN / NAS?
- Are they local? Current hypothesis is that it is
- Warm / Cold re-reading on disk is much faster
- The network – running huduup
- Latency
- Bisection bandwidth
- Within a datacenter and across datacenters
- Topology
- Did traceroute, already found that there are 2 datacenters… always goes through 2 routers.
- The clocks on EC2 lie to you due to virtualization
- Suggestions
- 2 models: suggestions for amazon about changes to their model (in addition to suggestions for those using EC2)
- Play around for “fast machines” if you have a long job.
- Differences of large vs small is interesting
- Benchmarks
- NBench – CPU benchmark and memory benchmark
- Stream – credible one from academia
- Patterson: C compiling is generally a good benchmark
- George said that C compiling is very disk intensive
- Patterson: Network card is goinig to cause contention.
- Claim is only way to see effects is to observe degraded performance
- Nbench is seeming to give unreliable results
- Andy/Matei: Application performance metrics on EC2
- Michael does Low level benchmarking. They are looking to do application-level benchmarking.
- They know how Hodoop runs in normal computing clusters
- They have modified Hodoop to trace and log information. They also have telemetry information
- They will do 500 or 1000 node traces.
- They will use Michael’s benchmarking to help explain differences in trace graphs.
- Patterson: People do map-reduce on 5 machines?
- Answer: They have a friend with an 8 node hodoop application.
- NY Times example: spawned 100 nodes on EC2 to do their number crunching.
- How do you set EC2 parameters to optimally work.
- In one test, they saw one machine doing disk writes 5 times slower. How dramatic is that effect?
- They will be looking at jobs like sort, and ones that come with hodoop.
- Tests that do both CPU and disk bounds
- CPU: generate pi
- Disk: write random data
- Tests that do both CPU and disk bounds
- They will rigorously watch at how hodoop scales beyond 80 nodes.
- Hard scaling – same data, 1000x computers
- Weak-scaling – 1000x data with 1000x computers
- Concolic Testing of Javascript
- Symbolic analysis to generate test cases
- Exploring the idea of dynamic languages like javascript
- Not only user inputs but also events
- She is just doing Firefox – the event engines differ which makes things difficult
- Currently just trying to get coverage
- Question:
- How do you generate test cases? The goal is to cover as much of the codebase as possible right?
- Do 1 test at a time, going round the loop.
- The other idea is to randomly fork new processes
- Javascript bugs in common opensource applications
- Lots of previous work on common C bugs in Linux
- How does this contrast with Javascript?
- Currently automated analysis of javascript bugs is underexplored. People have moreso looked at security issues (like server-side attacks)
- Double equals vs single equals are similar.
- Looked at libraries: scriptaculous and spry (from Adobe)
- Looked at the bugs between them. Most of them were not those suggested by the bugs.
- Biggest problem seemed to be when in a specific context (or Dom object) and you are accessing data that is not what you thought it was.
- Different behaviors would lead to memoryleaks in IE, when fine in Firefox.
- Goal is to quantitatively find the common bugs in javascript to allow for fine-grained coding that will directly apply to this. Most of these errors are not logical errors.
- Drag and drop and cursor movement causes many bugs because it changes context.
- Machine learning on EC2 data
- Taking the benchmarks from Michael. They will determine things like “bottlenecks from the disk”.
- Given the benchmarks, how can you make things better.
- If it’s a certain time of day, and you have load, go to a new server.
- Using monte-carlo ES, a form of reinforcement learning.
- Run a benchmark to determine utility, and you have the option to go to another server.
- Inputs are type, time, and utility, and have controls of stay on server, or switch.
- Goal is to maximize utility (pricing will be modeled).
- Question: When will you change it?
- Answer: We can model it as a function of our monitored load
Mar 10: Talk Recap, Project discussions (Scribe: Sameer Iyengar)
Discussion on Ari Steinberg Facebook
- Interesting common architecture patterns over all talks (build on top of something simple)
- Data API is a primary key store
- memcache
- How common is the 25TB memcache solution?
- How long to repopulate cache if data center goes down?
- Can't operate without cache
- What would a similar system look like if it didn't use any memcache?
- Better use of disk
- Use it more like a buffer pool than a cache
- Developer manages the namespace for memcache
- FB runs memcache on separate servers
- Distributing it to their machines to separate workload would be better
- Easier to maintain multiple images
- Does memcache have a master node?
- Keeps track of multiple servers automatically
- Scale argument not convincing
- Google infrastructure tightly engineered, FB not as well designed
- When will the current approach require fundamental rearchitecting?
- Google can't predict data relationships leads to BigTable architecture
- Current FB architecture limits data queries than can exist
- Build specialized architecture for queries we know, disallow the rest
- Might require reconstruction of table indexes to add a new data relationship
- Other systems move new relationships into the background instead of disallowing
- Big idea: how hard is it to add a new data relationship
- FB API doesn't allow arbitrary joins
- What are the steps necessary to make a new index?
- FB started with per-user partitioning, but had to move to a new structure to allow new interactions like "mini-feed"
- Relational databases combine two abstractions: transactions and queries
- Social networking inherently memory intensive
- Far degrees of relationships become increaingly more computationally complex
- How often need to go out many levels of friends?
- Maybe can bound at a certain level
Discussion on Andy Gordon
- Hard to understand interaction relationships in server clusters
- F# was hard to follow
- Static typing can help validate configurations
- Current formalism doesn't allow replication of certain nodes
- Base unit is a VM image
- Seems to be general trend
- Good match for EC2
- Old model: (Salesforce) Track platform and application logic separately
- In Andy's system VM image is opaque
- Base unit is a VM image
- Leverage new language instead of creating libraries
- Analogous to how Pig describes map reduce operations
- Can reason about declarative languages
Project Discussion
- EC2 characterization
- Try to characterize the variance of different machines
- Reverse engineer topology
- Develop recommendations for deployment
- Concolic testing for Javascript code
- Improve code coverage
- Think about specific javascript libraries (Prototype, Scriptaculous, Adobe Air)
- Maximizing performance of EC2 while minimizing machine allocation
Mar 5: Ari Steinberg – Facebook Architecture and Abstractions (Scribe: Kurtis Heimerl)
Talk
- Personal Info
- Engineer at facebook for 2 years
- Stanford alum
- Works on "newsfeed"
- Facebook background
- 66 Million users
- 1.5 Billion page views a day
- ~500 employees
- Armando/Ari discuss AJAX as a way to avoid page views to save computation
- Architecture
- PHP frontend, many servers
- 25 TB Memcached
- 1800 MySQL databases.
- thrift as RPC framework
- Open Sourced now
- Communicates between services
- many other backend services (eg search)
- architecture replicated in many datacenters (numbers given are total)
- Two in bay area, act as one as there is low latency
- East coast centers are autonomous
- DB
- Each user randomly assigned to a "local database"
- A new table is generated by creating a script that just does it
- Getting is simple (this assumed we have id information of what we're looking up)
- Create cache key
- lookup in memcached
- Unserialize if found
- If missed, run a dbget then set in memcached
- Multiget (function in memcached)
- allows you to get a bunch of items in parallel
- dbget what you miss,as before
- allows you to get a bunch of items in parallel
- set
- update db first
- dirty memcached (lazy caching)
- Propagation is west coast first, then bleeds to other centers
- Cookie set on user to force it to use new copy
- East coast visitors (not user) may see old data
- Armando clarifies race condition: visitor dirties cache, which is cleaned. Then the DB is updated, so we have different data in clean memcache and DB
- So DB replication must dirty cache (cache keys sent with replication data)
- Question regarding the rationale for master status of west coast center
- Master/slave is easier to write, as speed of development is a priority for business and technical reasons
- What about west coast failure?
- Lower QoS concerns. Before, the whole site would drop, so slow faulty redundancy is better
- Also, rush was because of the webserver keeling over due to scaling. Cross DC architecture designed in less than a month.
- PHP for data storage
- API for all of the above, per table
- Problems
- Slow development process
- Schema change causes code change, custom wrapper
- Error prone
- More code is harder to cover
- Miss cases
- Small number of people working on a project
- No common code paths, each is unique
- Lots of custom code
- Data tied to user
- Slow development process
- How do we fix these? Two types of data
- Associations - Links to objects
- Objects - Data about an object
- Both - People associated with objects
- Create generic object storage
- FBTypes
- Equivalent to table
- List of attributes (columns)
- Type information
- FBIDs
- Identifiers
- Stored in FBTypes
- FBObjects
- Row in database (sparse, defaults given)
- has an FBID and FBType
- Lookup
- Get DB for FBID
- Lookup FBType from FBID table in DB
- Get list of fields
- Different priority Memcached instances, tag on data to send it there
- Associations
- Way of looking up FBIDs
- Typed [a]symmetric
- Allows for one way associations
- Applications are asymmetric
- Groups are tricky, could scale to destroy DBs. Special cased
- Armando: Why not use join semantics, in some nice way (as these are joins). Punted
- Has a get(id1,id2,association) which can cause problems. Pushed to O(n^2) due to symmetry. I don't buy it.
- FBTypes
- FQL (Facebook query language)
- Queries for application developers
- why?
- Reduce total number of requests
- reduce unnecessary data being tranferred
- Intuitive
- How?
- Get SELECT FROM WHERE
- all queries must be indexable, meaning that you can't just get everything
- Recursion (inner queries) allowed
- Applications can be throttled for eating bandwidth
- Thrashing can be a problem
- Field evaluation can be done in parallel
- Questions
- Malicious Facbeook apps
- Not really malicious, but rather want facebook up. It gets them money.
- Andy discusses the FQL model. There are not "variables" each subquery is autonomous. Will there be an expansion of the language?
- Not likely, the background model doesn't fit any great set of operations
- Driven by product demands, which this fits.
- Smaller changes possible, as consumers demand. One instance is allowing the user to select the database for optimization.
- Another lets you set the DB to be able handle range queries
- Joins are just a bad idea, they don't scale. Gives examples of websites where using them limited their scaling.
- Internal DB APIs for more complicated DB access?
- Yes, hadoop is uses that DB to do large batch jobs.
- Search likewise uses a different API.
- Is not "eating own dog food"
- Custom DBs/APIs or one generic solution.
- Custom is the systems solution.
- Why MySQL vs Hadoop's BigTable implementation
- Built a few internally, not good enough.
- This abstraction is highly simplified, don't need big table.
- If we wanted to massage constraints, it could be nice.
- MySQL is another problem. They are running out of memory.
- used solely as a backing for the hash table. Working to remove it and build a simpler system.
- Should be replaceable though.
- Malicious Facbeook apps
Mar 3: Andy Gordon – Baltic: Service Combinators for Farming Virtual Machines (Scribe: Jeff Tang)
Talk
- Various trends are turning data center management into a programming project
- Business logic is well understood, core tasks of the business
- Operations logic is a hot topic, performing tasks that would be done by human resources, run book automation
- Provisioning - how many instances of each role and where
- Deployment – install disk images, start instances
- Interconnection – connect instances together
- Monitoring – monitor instances for failure, load events
- Evolution – respond to events, change deployment
- Typed approach to operations logic within a functional language (F#)
- Managing operations of service-oriented virtual machines on a single host
- Design, implement, and formulate semantics for a combinatory-based management API
- Assign function types to whole disk images, catch interconnection errors statically
- “A virtual machine is an efficient, isolated duplicate of the realm machine”
- Virtual machine monitors contain one or more VMs (Xen, Virtual Server, VMWare)
- Trend towards automation of operations logic to improve failure and recovery rates, reduce need for actual operator involvement
- Operator errors are the leading causes of errors in web properties
- Want to automate, improve reliability, decrease HR cost
- Component models for server farms
- Md 90s – binaries - Architecture description languages (Darwin, Ponder)
- Late 90s – JAR files – distributed component models (SmartFrog)
- Mid 00s – Machine images – Cloud infrastructure (Google, Microsoft Autopilot)
- Late 00s – VM images – multi-vm virtual appliances (VMware, 3Tera)
- A program to run on a server farm consists of
- A set of executable components (business logic)
- Sets of endpoints imported and exported (interfaces)
- Script for tasks to provision, deploy, interconnect, monitor, and evolve the program (operations logic)
- Server roles are increasingly service-oriented, they import and exported typed endpoints
- Service implements a function accessible by RPC
- WSDL is a common metadata format, based on SOAP
- The Baltic component model
- Set of disk images provided
- URIs and WSDL for services imported and exported
- Typed F# script as the operations logic
- Like EC2, writes to the local disk image are discarded on shutdown
- Baltic server acts as a proxy between services, clients, and VMM
- Baltic Client manages the Baltic Server
- Baltic Server manages the VMM
- Scripting examples consist 3 web services used as the backend for “Adventure Works”. Each one is a disk image and provides a WSDL for the functions it provides
- OrderEntry – checkout
- Payment – authorize payments
- OrderProcessing – fufill the other
- Metadata for disks and I/O
- VM
- Disk
- Inputs
- Outputs
- Import – external services
- Export – url
- VM
- API is generated from the metadata
- Typed interfaces to resources, generated from metadata, are more productive and less error—prone than conventional device-oriented APIs
- Use WSDL to describe a cooperating set of VMs and services provided
- Baltic – construct graphs out of API calls in F#
- Endpoints are (a, b), input a, output b
- A service is a tuple of endpoints
- When you call a function, it will boot the corresponding VM, takes the endpoints and returns the endpoint it exports
- Ideally the OS can be anything, but slight modifications are required to make an RPC call to the VM to learn necessary information
- Exporting or importing an endpoint creates a forwarder to a physical server
- Addressing details (URLs) are hidden by the API
- Create intermediary points for load balancing (type: Or)
- Intermediary multiplexes requests (round robin)
- Utilize multiple processors by creating multiple instances
- Ref nodes can be implemented as an intermediary for dynamic update
- Handlers are called in the event that a crash occurs
- In the event of a crash, the ref node will change to a working VM
- Replicate VMs
- Formal semantics
- Assume models for each disk image and imported endpoint
- F# expressions represent behavior abstractly
- Can be written by hand
- F# program based on API in a concurrent, partition lambda calculus
- Executed symbolically for debugging
- Enables static analyses as type checking
- Type safety – allows finding connection errors, typical operations logic error
- No migration since there’s only 1 physical machine
- Device combinators
- Existing API for VMMs manipulate a wiring graph
- Each node is a device (disk, machine, network adaptor)
- Each edge is a virtual wiring
- Hypothesis: Service-oriented API and fine-grained static typing is feasible and more productive
- Combinators – manipulated call graph
- Virtualization technology makes possible the automatic assembly of big components like DBs and OSes into systems
- API like Baltic is the way to exploit this possibility in a typeful and declarative style
Feb 27: Project Ideas Discussion (Scribe: Jeff Tang)
- Evaluation of usability on Hadoop on EC2
- Measuring performance of EC2 to see if it’s a useful tool for research
- What should I know about EC2?
- What should I expect?
- What can I quantify on EC2?
- Measure variability & Speculate about good & bad ways to deploy SaaS on EC2/S3
- Good and bad things to do when deploying SaaS
- Experiences as a developer in terms of variability
- How do you design applications given variability
- What would you do differently if there was no variability
- What characteristics of hardware as a service do we want?
- Amazon is the first data point
- Joyent is a second data point
- Google isn’t public
- Sun service (Caroline) is too small scale
- Gridhosting providers (MediaTemple)
- Comparing Joyent vs EC2
- Find consistent problems across hardware as a service providers
- Compare “buy vs. build” (EC2/S3)
- How does black box vs gray box matter for SaaS?
- What API calls would you want to build to hook into the system?
- Reverse engineering Amazon EC2/S3
- How does black box vs gray box matter for SaaS?
- Extend Concolic testing for client SaaS apps (Javascript)
- Better testing
- Discover bugs on client
- Automatic test generation for code coverage
- Selenium testing framework
- Where does the time go in “booting” a web 2.0 app
- Time to launch application
- What’s the bottleneck?
- Rendering complexity
- Concurrent network connecitons
- Javascript interpretation
- People reuse old javascript libraries
- Track changes of a standard library
- Do people update their libraries?
- Copy and paste of inefficient Javascript and bugs
- Use cheating programs to detect copy and paste
- Browsers have different Javascript security models
- AJAXScope, Greasemonkey, Chickenfoot
- Visualizing information on the webpage on the fly
- Machine learning & SaaS
- Load balancing on EC2, allocating servers on the fly
- Automatically dynamically minimize cost while meeting SLA using SML
- Talk to Peter Bodik, Charles Sutton
- Application development frameworks to express service level parallelism (John Osterkhout, Will Sobel)
- What concepts do you need to be able to express?
- What current application frameworks allow this?
- Can we extend existing frameworks to add this functionality?
- Web is instant deployment enviroment
- Is traditional software better at updating libraries at SaaS
- Push vs pull update model
- Do SaaS providers update their libraries?
- What is the time for adoption?
- Empower the browser to detect old libraries and replace broken code
- If it isn’t broke, don’t touch it
- Only use stable libraries that don’t change
- Everyone always has the latest version of SaaS, avoid DLL Hell
- Expectation of new features from users, less time for testing
- Number of different browser platforms vs number of different operating systems
- Brower statistics
- Firefox 37%
- IE7 21%
- IE6 32%
- Safari 2%
- IE5 2%
- Libraries exist for javascript to code around different browser issues
- Update libraries to provide higher efficiency, better user experience, security fixes
- Side effects of upgrading libraries in SaaS vs traditional shrinkwrap
- Brower statistics
- Andrew Fikes - BigTable
- Google copes with MySQL deficiencies to create scability
- Internal version of SaaS provided by core team
- Lots of user complaints
- Very loose internal structure
- Continuous service testing
- Detailed logging of users to determine issues
- Unit test of SaaS
- Can’t unit test application hotspots
- Write unit tests of traces in SaaS execution
- Live time performance over correctness
Feb 25: Bob Felderman, Google Platforms Networking (Scriber: Andy Konwinski)
Talk
- about this talk / the presenter
- has been at google for almost 2 years
- he's stealing slides adam bechtel (yahoo) and chuck thacker (microsoft) since google doesn't talk much about their datacenters
- A gazillion machines...
- is (probably) much more than a million
- Template slide on the evolution of the machines in their datacenters
- You should care because
- datacenter in a box (think ibm blue gene) is coming SOON
- General SaaS design philosphy
- even Symmetric Multiprocessor machines require software level redundency, etc., so why not code for it throughout and buy commodity hw
- Getting back to Networking
- getting gigabit ethernet to...
- a single server: ~$6
- a server rack (40 port switch): ~$50/port
- a cluster (thousands): priceless
- It sucks to be forced to make programmers understand and code for locality
- but getting all servers (n to n) connected with gigabit is really really hard
- getting gigabit ethernet to...
- it's not just hardware
- a polished toolbox is crucial
- on web search touches a 1000+ machines
- an interesting visualization of a map-reduce-like job! (looks very familiar :)
- they notice some tasks taking longer than nearly all the others (by some multiple of 50ms)
- this looks a lot like a tcp timeout, and this turns out to be true
- this is the fan-out/in problem that too many packets come back form the fan out for them all to be buffered...
- low variance response time
- response times are large
- ...
- things to think about
- a hodge-podge mix of a diverse apps running in the same cloud
- layer2 (ethernet w/ vlans, not thinking much about iprouting) vs layer3 (or iprouting)
- layer2 for a small broadcast domains (like within a rack?)
- layer3 is very valuable at higher levels (across smaller domains) (avoid "broadcast storms")
- QOS and flow control
- this might be well explored, but IMPLEMENTING it is tricky nonetheless
- LOTS of control knobs, like...
- having app teams change behavior of app
- tcp level (windowing policies)
- traffic shaping in linux
- nics have their own traffic shaping
- PAUSE frames (feedback loop for slowing/pausing specific tcp connections)
- Can we just use the solutions that the HPC community has provided for these same datacenter problems?
- maybe the problems they have solved are not our problems??
- a lot of the HPC supercomputers are focused on smaller numbers of larger codes
- they focus on low latency (but this is not as important)
- blah blah blah, chuck thacker is sooo cool so let's repeat his talk... again (boo, hiss), the Sun black box is sooo cool, blah blah blah
- Trends from "a Web Company" (yahoo)
- bandwidth growing faster than moores law (doubles every 12 mo)
- NAS vs San
- metro's are ethernet
- ...
- big picture architecture: (from yahoo)
- backbone
- metros
- data centers
- metros
- backbone
- Beyond 10GigE (Still yahoo)
- Can't we just use Link AGgregation (LAG) to make several 10GigE connections look like a 100gigE connection
- power of 2 problem
- how about same idea at layer3, i.e. Equal Cost Multi Path (ECMP)
- finite FIB space on the chips (filling "router tables")
- where we're goin: bigger pipes:
- he (i.e. yahoo) uses something different than hypercube, but similar, rat's nest looking thing, and then another, even more convoluted rat's nest, but what we really want is...
- fewer, bigger pipes between l3 switches (at least 80GigE)
- "We are all doing the same thing (GOOG, YHOO, MSFT, EBAY, ...)
- Big & getting bigger
- Fast & getting faster
- Many (diverse) & getting manyier
- Radical changes in network API not feasible (too much "legacy code" using current API's), even if we go to something new, we need backwards compatibility.
- There are 'no answers, i.e. the problem is unsolved! "We're all fumbling in the dark, trying to figure out the best way to do this."
Q&A
- Q: are there any hw solutions, e.g. FPGAs, fiber
- A: thacker's hypercubes were made of FPGAs. A more open core/high-end router (zorb ?) project, maybe even wavelength division multiplexing
- Q: go to udp?
- A: QOS bits in the network protocol (or at the application level)?
- Q: what about inter-datacenter comm (speeds)?
- A: they do replication between datacenters enough that they own LOTS of fiber, for inter-datacenter networking, they have many hundreds of gigabits per second and growing as fast as possible. But, when they get into the real, capital "I" Internet, it now makes sense to go to Cisco.
- Q: don't you use some network virualization layer? or do you actually have c code using sockets straight up?
- A: in short, yes they use abstrations (i.e. libraries) and they're thinking about it, but it is still a hard problem
- Semantics in RPC library (when things wake up or don't, etc) or other libraries. There are, sadly, semantic remnants (bugs in the virtualization semantics) which means that things are not actually properly abstracted, they have a team currently working on it "we're going to architect a new RPC layer".
- A: in short, yes they use abstrations (i.e. libraries) and they're thinking about it, but it is still a hard problem
Feb 20: Andrew Fikes with Google on BigTable (Scriber: Andy Konwinski)
Slides
Image:Andrew-Fikes Google Big-table slides.pdf
Background on BigTable
- they use mapreduce, multiple petabytes
- now the problem is shifting as they move towards applications (google docs, orkut)
- green engineers are a problem
- running software as a service:
- big-table, 2 years now
- efficient use of shared resources is very important to them
- "not as good as they like" with utilization - boo! lame/vague language
- educating users about how to use the services inside
- hardware philosophy: "truckloads of low-cost machines"
- coark-board machines
- no special machines (just 2-3 flavors)
- current hw: 8GB+, multichip/core
- all machines 1 Gig ethernet
- typical failure for new cluster:
- .5 overheating
- 1 pdu failure (~500-1000 machines disappear)
- 20 rakc failures
- 8 network maintenances
- machine sharing
- problems from resource usage bleeding are hardest to debug
Big Table Details
- Row & col abstraction for storing data
- Engineers love it because:
- create a cluster in < 1min
- scales to Order 1000+ nodes
- millions ops/sec
- terabytes of data in memory, petabytes on disk
- does data to load balancing (e.g. handles partitioning keyspace automatically)
- node addition, failure, removal handled automatically
- user facing apps use it
- tablets
- chunks of tables on row boundaries
- typically 500M-1GB
- tablet size is important (affects # files in sys, # servers touched, load on one machine)
- splits/merged done auto based on size
- Locality groups: data also split along columns
- column families live together within a tablet
- each local group separate group of files
- usage
- 500+ bigtable cells
- largest cell has 68 internal numbers
- 10-12 full admins maintaining big table service
- hard problems:
- isolation
- capacity planning
- software upgrades
- common problems/hiccups
- cross-user isolation (usually file layer)
- same user isolation (competing jobs)
- loss/corruption of file data chunks
- corrupt memory
- unplanned growth
- network saturation
- big table bugs
- Service Isolation
- 1) priorities, user mark with 1 of 3 priorities (e.g. mr default to lowest)
- 2) scheduling, ops have a cost (CPU+network+...)
- 3) partitions, new direction (exclusive use of resources, but highly dynamic)
- Service/Software Upgrades
- clients advised to test against nightly builds
- regression tests
- cells earmarked as "first-upgrade"
- staged rollout
- Debugging bigtable
- hard because tablets move around, load is temporal
- customers give *terrible* information
- data loss never discovered immediately, so state needs to be reconstructed later: info/logs squirreld away all over!
- Monitoring
- servers export monitoring interface
- data is aggregated, and visualized
- Probing
- Inject requests with known performance
- gmail doesn't use bigtable
- some apps that don't fit well are those that are mysql-centric, with complex inter-table relationships, no good support for adhoc queries
Feb 13: "Chris Olston" (Simon Tan)
- It's all about DATA
- A lot of work at Yahoo! is about data
- Extracting structured data
- Understanding complex large-scale data
- Example: Detecting faces
- Function detectFace() over n images, where n is very large
- Example: Studying web usage
- You have a web crawl and click log
- Need to find sessions that end with the "best" page
- Existing work
- Parallel architectures
- Cluster computing
- Multi-core processors
- Data-parallel software
- Parallel DBMS
- Map-Reduce, Dryad
- Data-parallel languages
- SQL
- NESL
- Parallel architectures
- The Pig Project
- Good with distributed filesystems
- A data-parallel language: "Pig Latin"
- Combines relational data manipulation primitives
- Filter, join, etc
- Imperative programming style, not declarative
- Easy to plug-in custom code to customize processing
- Combines relational data manipulation primitives
- Data-parallel software: "Pig"
- Cross-platform optimizations
- The Language
- I = load '/mydata/images' using ImageParser() as (id,image);
- load: What is the structure of the data we need?
- using: How do I interpret this file/directory? Are there delimiters?
- as: What do I do with it?
- F = foreach I generate id, detectFaces(image);
- foreach: for each item, do something
- generate: Output the probability that it is a face
- store F into '/mydata/faces'
- F: Store this back into the filesystem
- I = load '/mydata/images' using ImageParser() as (id,image);
- Notion of Sessions
- Big log of visits in triples (who, where, when)
- i.e. (Amy, www.cnn.com, 8:00)
- Quality of pages in pairs (url, pagerank)
- i.e. (www.cnn.com, 0.9)
- Solution: Distribute Pages into many disks, same with Visits
- Associate the two data sets, i.e. with JOIN
- Allocate a thousand compute nodes
- Combine Visits set with Pages set, according to url, send to nodes
- "Amy's" records could get scattered
- But we can join again and send to new nodes
- Local processing after retrieval from the data store
- Pig Latin solution:
- CODE:
- Associate the two data sets, i.e. with JOIN
- Big log of visits in triples (who, where, when)
Visits = load '/data/visits' as (user,url,time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load '/data/pages' as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into j'/data/happy_endings';
- Each Session has some number of clicks
- BestIsLast(*) is a boolean function
- We've basically built up a simple language that captures the top 10 things people want to do with data
- Easy to code up from high-level primitives (easy for the users)
- Easy to parallelize the execution (easy for the system)
- Operators include:
- FILTER
- FOREACH ... GENERATE
- GROUP
- Binary operators include:
- JOIN
- COGROUP
- UNION
- How it compares to SQL
- The user doesn't have to fight the query optimizer
- SQL is only as good as the Query Optimizer
- SQL is declarative (what, not how)
- SQL bundles many aspects into one statement
- Pig Latin is a sequence of simple steps that each do one thing
- Makes this closer to imperative programming
- Semantic order of operations is obvious
- Incremental construction
- Debug by viewing intermediate results
- Jasmine Novak, Engineer at Yahoo!: "I much prefer writing in Pig [Latin] versus SQL."
- May not apply to everyone; programmers tend to want to just program
- The user doesn't have to fight the query optimizer
- How it compares to Map-Reduce
- Map-reduce welds together three primitives
- Process Records > Create groups > Process Groups
- Sometimes people want Map-Reduce, without the Reduce
- People doing JOINs in Map-Reduce; do them differently, have to think about query optimization with varying data sizes
- People want SORT, but others don't
- Map-reduce welds together three primitives
- In the end, Pig Latin is a more natural programming model
- Plus, optimization is given to you
- SORT - you can parameterize it
- Can support an epsilon sort
- Have a way to group or join by multiple values (matching on any of several fields is okay)
- People can extend the language, make macros for the language, etc.
- How does it compare to PLINK at Microsoft?
- Embedding it is harder with PigLatin
- Microsoft is more focused on SQL-related programming
- Pig Latin is closer to Dryad's model
- Like to use Map-Reduce as a backend (Hadoop)
- User can choose to write SQL/Pig/Map-Reduce, which points to the same cluster
- SQL -> Pig -> Map-Reduce -> the data cluster, where Pig automatically rewrites and optimizes
- This stack is almost like an entire RDBMS
- Pig is open-source!
- Is Pig a DBMS?
| DBMS | PIG | |
|---|---|---|
| Workload | Bulk and random reads & writes; indexes, transactions | Bulk reads & writes only |
| Data representation | System controls data format; must declare schema | Pigs eat anything |
| Programming style | A system of constraints | Sequence of steps |
| Customizable processing | Custom functions second-class to logic expressions | Easy to incorporate custom functions |
- Ways to run Pig
- Interactive Shell
- Script file
- Embed it in a host language like Java with JDBC
- Useful if you need language constructs (like loops) that aren't available
- Few people use this; would rather make Pig Latin scripts
- Work with PageRank, for example, may need this
- SOON: Graphical editor
- Pig Pen Environment
- Want a data catalog to see what data is available
- Ned a library of processing elements
- Boxes-and-arrows dataflow
- Automatic example data
- Boxes-And-Arrows
- A graphical representation (a flowchart) representing a Pig Latin program
- An example of a tool that shows you the semantics of your program before you run it on real data
- The flowchart is augmented with sample data at each phase of the processing
- Helps people understand the movement of the data
- Generating Example Data
- Would like it to at least look like real data
- Concise
- Completeness (no empty boxes)
- Challenges
- Large original data
- Selective operators
- Noninvertible operators (user-code that is not easily undone)
- The Pig System
- Parser -> Pig Compiler -> MR Compiler -> Map-Reduce -> Cluster
- Would like to work on optimizing at the Compiler level
- Redundant work
- Small number of popular tables (i.e. web crawl, search logs)
- Most of this processing is I/O bound, not CPU bound; just counting
- Popular transformations (i.e. eliminate spam, group pages by host, join web crawl with search log)
- Our goal: minimize redundant work
- Work-Sharing Techniques
- Execute similar jobs together
- Cache data transformations, if they are used often (materialized views)
- Cache data moves - if joins between A and B happen often, they should be stored together
- Executing Similar Jobs Together
- In a job queue, what is the optimal queue ordering policy? (Optimizing throughput)
- If we have two jobs, Job 1 taking half the time of Job 2
- Can schedule Job 1, then Job 2
- If we have two jobs, Job 1 taking half the time of Job 2
- Record frequencies (λ1, λ2, etc. ) of jobs
- If λ1 >> λ2, schedule job 1 first?
- If λ1 = λ2, tradeoff between sharing opportunities and getting jobs done
- You may not know the length of the jobs in the first place
- In a job queue, what is the optimal queue ordering policy? (Optimizing throughput)
- Caching Data Transformations
- What do you cache? Operations 1, 2, and 3 in sequence
- Cache Op2 output?
- Cache Op3 output?
- Cache both?
- "Selecting materialized views"
- Each operation has a cost, and a selectivity
- Model it as a search problem
- Our approach says: It is very difficult to get perfect information
- Sometimes machines will be down, network saturated, etc
- Do something more adaptive
- Could cache subsets of data
- If you have 1000 choices, you can try sampling 1% of each one for a "diversified portfolio"
- Can do adaptation - once you know how much time you're saving for each set of cached data
- Models can be very brittle
- What do you cache? Operations 1, 2, and 3 in sequence
Question and Answer Section
- Used at Yahoo! - Has a few 100s of users; not as wide-used as Map-Reduce
- A lot of startups apparently are using Pig
- Looking to do more collaboration with outside people; model is gaining momentum
- Currently, people are manually doing these things (i.e. optimization, caching)
- Would like to automate this
Feb 11: "Charles Gordon" (Simon Tan)
- Charles Gordon, with Amazon since Jan. 2002 - "Search Inside the Book"
- Moved to IMDb in 2006
- About IMDb (since 1989), one of the oldest website on the web
- Serves Movie data
- Started as a way for people on newsgroups to track hot actresses
- Correct and up-to-date information on movies
- Have added voting, community features, etc.
- Integrated with Amazon to sell DVDs
- Share data centers, but not technology/software stacks
- Export data to Amazon to use
- Uses S3/EC2/SQS, but not Dynamo or SimpleDB
- Browse/Search Data, localized information (movie theatres)
- A Pro Site for Industry Professionals, i.e. for fledgling actors
- Has an XML Web Services, licensed or for free
- IMDb's Data
- Where does it come from?
- User submissions
- External data feeds
- Wiki
- Editors weed out questionable materials, sometimes filter out with machines
- Stop the spam
- Not user-oriented (i.e. does not revolve around userIDs)
- Data is many many JOINs, making partitioning difficult
- Much hierarchy in data (seasons, episodes, etc.)
- Where does it come from?
- IMDb's traffic
- Around 100 million requests per day
- Thousands of hits per day
- How we store the data
- List -style (flat text file) organizations
- In the Beginning
- Just plain list files
- Mostly Static content
- Some CGI
- First "Search" - use a grep across all the files
- Business Requirements/Desires
- Less competitive on data quality
- Focus on features
- Looking for a flexible, fast data organization
- Near real-time updates for all data
- High availability for writes from the website
- Some data can wait a few minutes to be updated (i.e. 10 minutes)
- Need option for read-after-write for some writes
- User Registration
- User voting
- If page view go down, we know latency is increasing
- Users are impatient, will leave
- Data Master
- Has, in list format, all the data
- Daily Build
- Look at all the front-end lists, pull out in templates
- Compile a list of de-normalized views
- 10s of GB goes to 100s of GBs
- MySQL cluster
- Hate it
- Keeps a days worth of voting
- User Registration
- Rapid Mirror Masters
- Pushes data in XML
- Submissions
- A server that takes in e-mails formatted in special ways
- Like 40000 scripts managing submissions
- Problesm with Data Master
- Flat text file - historical
- No schema - got to use sed/awk
- Keys are split
- Not that many problems with the filesystem
- Daily rebuild, only when 0.5% of the data changes
- Pushes everything over the network
- Automatic generation of view can go haywire
- When templates are messed up, data can blow to terabytes
- MySQL architecture
- Nothing but trouble
- Availability is fairly low (92-93%)
- Replication has bugs
- Master / Slaves not working together
- Sometimes Slaves are ahead of the Master
- A bottleneck
- A single point of failure
- Technical Wants
- Need ad-hoc querying on normalized data
- Need de-normalized views for templates
- High traffic
- Need a good query language for everything
- Simply scalable
- Low maintenance (we only have ~12 developers)
- Remote serving capabilities
- Redundancy
- Why not caching? Yes!
- LAMP stack is not good - harmful
- You end up with a denormalized system
- Requires partitioning
- Availability is low
- Does not degrade gracefully
- RDBMSs are overly general
- Databases are trying to so many things - nothing you need
- Shared resource requires planning
- IMDb doesn't want to invest in more DBAs
- No quota enforcement
- Rapid Mirror everything
- Can't get read-after-write
- Space on web servers at a premium
- No plan for storing normalized data
- Need a way to store writes
- Difficulty: Trading availability for performance in the face of network difficulties
- We don't want 502s, we don't want some parts of the page loading but others not
- Hybrid Approach
- Inserts onto the Master, copied to standby
- Using triggers
- Not obvious how obvious you make your triggers
- More granular than normalized data, less granular than normalized views
- Normalized view on the backend
- Backend database has a queue, with an API
- Worker process reads events from the queue
- Concurrency issues
- Solution is to sequentially read the queue
- Want to poll some of the data to the web servers
- Concept: Use a version control system
- Software that can merge de-normalized data
- Still need a way to scale writes
Question and Answer Section
- Triggers happen in order
- Have a history table in the database, so we can do undos
- Makes inserts slower, code more complex
- EC2/Virtualization
- Does I/O terribly
- Moved all systems to no I/O - all memory
- Everything served by RAM
- One problem users experience ("wavering reads")
- A user updates
- When refreshing, user sometimes sees old value, sometimes new
- Result of the various inconsistencies across mirrors
- Agrees with the EC2 philosophy
- If you can solve my problem by giving me a new server, so be it
- "Share-nothing" is ideal
- Costs of cabling, cooling, etc. a detriment
- Hardest thing about building an architecture is teaching your engineers about it
- Want a language to specify the views
- Not the content, because then you'd need an API for your templates
- Data should not be in view
- Want to signify that one set of data is more important than others
- Glue we really need - code to auto-generate those triggers
- Would rather specify at the level of triggers
- Figures out how things will change in the database, executes
- Amazon had a DataStore (??) solution that never made out of design phase
Feb 6: "Class discussion of last 4 speakers" (Bryce Lee)
- Administrivia
- Laptop use (other than the scribe) is no longer allowed in class
- Chuck Thacker discussion – what data centers should look like and how would I design one
- Comment: Talk has element of “let’s do something we’ve done 20 years ago”. Didn’t understand reason FPGA. Why not use a pure software router?
- FPGA is a lot faster. There is a loss from full hardware to FPGA in performance, but not as great as going to pure software.
- Widely used technology, anything in the last 5 years is FPA based.
- Cisco has a model for low-end competitive hardware and one for high end with lots of features. Some customers want it but not all
- Data center routers don’t need the same functionality of gateway routers
- Thousands of dollars per port
- Google complains about Cisco routers, wishing there was an alternative. Since they design their own PCs, probably designing their own routers to meet price per scale.
- Still debatable that this datacenter design would meet needs for more than a handful of companies
- Why isn’t Cisco doing this?
- Cisco doesn’t feel the high-end pressure. How big is the market?
- Google and Microsoft trying to drive out all the profitability out of the product.
- Others entering the market impeded by the number of customers in the market.
- Google has kept companies in business
- ForceTen – high performance routing kept in business with orders.
- Looking inside FPGA, most functionality is for routing. High speed links on the outside+moore’s law.
- Comment: don’t know if the mobility has to be strapped down to proposed dimensions – but Internet is a limited, scarce connection in some locations.
- The containers allows for incremental buildup rather than build up of an entire data center.
- Container is a modular building block.
- Self-contained nature takes away a lot of the prefabrication assembly effort. When buying from Sun, it will just show up
- Like building RV datacenters
- People at Microsoft feel like they’re wasting money building datacenters like they do today. Want to be able to deploy the latest in cooling at the same time.
- What are the downsides?
- Datacenter people are relatively conservative in the company. Not the first on the block to try a new idea.
- We think of them as machine rooms, other people think of them as construction locations. Is it a building or a place for computers
- What about security applications? Can be shipped away very easily
- Price point at a half million dollars maybe. Probably value based pricing
- Why isn’t the uptake as high right now?
- Sun’s not going away. However, production guarantee is interesting.
- Do Microsoft and Google really need JIT provisioning?
- Black box simpler for deployment
- Black box has standard racks. Allows people to put their own equipment in their
- If you’re a computing company, you’d rather not have to worry about the cooling
- What does Amazon use to power S3 and EC2?
- A few years ago it was a standard setup
- Amazon is pretty secretive – since lots of competition, they will probably not tell people what they do
- Comment: Talk has element of “let’s do something we’ve done 20 years ago”. Didn’t understand reason FPGA. Why not use a pure software router?
- Thorston von Eicken
- How is EC2 competitive
- Can scale dynamically, but if you are just a small company you wouldn’t go there.
- It’s fungible to handle variable load
- He made claimed that if used straight for a year, still pretty cheap
- Problem is matching current demand
- In this model, can scale up as long as you get money
- You can get cheaper services, but there’s managed costs instead
- Also data replication
- He tries to show alternative view – infinite machine at hand
- Unit per month
- How much value do you get from handling the burst?
- The advantage of automation through a web interface
- The red question: why does this feel like a new thing?
- You can get as many machines without talking to any human beings
- Pay as you go with fine granularity
- Very large scale – ask and provide with no questions
- Must think about software differently in this environment
- No idea of a ticket – abandon things instead of trying to fix it
- In order for this to work, must not worry about limits of scalability
- Opens up possibilities such as staged rollouts
- Amazon solves Colocation problem of automation and granularity. Collocation sites don’t want you to scale down
- Some colocations going this way (softlayer)
- A lot of other people have tried and failed at utility computing such as HP and IBM
- Amazon seems to be the first to gather real customers
- Built out of their own pains
- Popularity not just within startups – like virtual machines, everyone is finding a use for it.
- Why isn’t EC2 making money?
- Some believe EC2 was only created so Amazon could handle the holiday spike. However, the services are probably ran on different servers
- Amazon already had to expose API for 1000 merchants to run on their platform. Amazon thinking of long term bet.
- EC2 is different
- Really cheap, infinite scalability, no guarantees
- Security
- Medical companies pay $100 per GB per year to secure data
- Multiple keys/ threshold/joins
- Computer clouds could offer overflow for financial companies, but these companies are too paranoid to accept idea
- Malicious activity handled either with a human monitor or passively through complaints
- Availability similar to Google
- Redundancy provided on top of unreliable stuff
- Hope that enough people use it to cover the costs
- Right now pried in way that it is working.
- May move to a pricing which charges more during different times of the day to help calm concentrated data
- Thorston’s model – if something’s wrong, get another
- Amazon does not make any guarantees, but how often do you have a problem
- Company is fanatical about security, not going to lose credit card information. Therefore will not make false guarantees
- Building up an entire community around software increases switching costs
- What do you do when you want something more sophisticated with the data?
- Going to be a balance
- What are the licensing issues with it?
- Opportunity for standardization of instruction set, open source
- Competitors in UK with a SAN for storage.
- How is EC2 competitive
Feb 4: "Thorsten von Eicken - The Future of Software: In the Cloud" (Bryce Lee)
- Biography
- Did a PhD on active messages with David Culler as his advisor
- Afterwards, joined Cornell University and earned tenure there.
- Started a company in Santa Barbara which was acquired in 2004
- Taught a course at UCSB built around developing web apps with EC2, S3
- Software in the cloud: Software delivered as a service and deployed with EC2
- Cloud computing: endless computers on demand, commodity priced
- Pay-as-you-go pricing model
- Do not have choice of location (“somewhere on the internet”) and not much control.
- Amazon pay-based web service
- EC2
- comes in 3 flavors
- Small($876/year): 1.75GHz, 2GB RAM, one spindle 160 GB storage, ¼Gb Ethernet
- ½ core in reality
- Large($3300/year): dual core, 8GB RAM, 2 spindle 420 GB storage, 16GB RAM, Gb Ethernet
- An interesting configuration for an I/O database
- Small($876/year): 1.75GHz, 2GB RAM, one spindle 160 GB storage, ¼Gb Ethernet
- Extra Large($6600/year): Large with doubled specifications
- Everything is virtualized using virtual Zen drivers
- Amazon doesn’t say much about exact specifications to prevent people from optimizing for particular configurations
- For example, ½ core may move to 1/3 core
- Promotes attitude of not caring, more power should just lead to ordering more hardware.
- comes in 3 flavors
- S3 – simple storage
- Not integrated with EC2
- Very simple database, more like a hash table
- Pay by volume of transfers
- Software cannot run on S3, interact with HTTP Put/Get interface
- Simply a bucket hash table. Can list contents and put object keys with 5GB of data.
- Good place to store stuff and backups, but not designed as a file system.
- Von Eicken’s discovery of S3 and EC2
- Got preview from Amazon, jumped on the public beta
- Taught class with students who wrote Ruby on Rails application on top of services
- Would never go back to data centers
- Q&A
- Q: Is Amazon cutting into other departments’ budgets to subsidize the service?
- Amazon doesn’t do anything for a loss.
- Q: What type of reliability guarantees is there? How many 9s? Can people call someone if there is a problem?
- No. Services are built on a model of using the lowest, cheapest motherboards – thousands by the rack and minimal human labor
- Q: Are these the same machines Amazon uses for their retail services?
- Amazon uses EC2 internally, but it’s unknown if they share the same public servers.
- Q: Is Amazon cutting into other departments’ budgets to subsidize the service?
- Business Standpoint
- No upfront costs. Good for campaigns and services whose costs fluctuate seasonally.
- There’s always a battle between demand and resources
- EC2 allows you to dynamically scale on demand
- Fluctuations can be seen on a daily basis with diurnal peak cycles
- Also aids testing with easy, quick deployment of resources
- Case study: Animoto
- Creates videos from photos and music
- Ran by two guys using 50 servers on EC2 with no system administrator
- Two front-end web servers with load balancing going through Rails, upload servers through S3, and other server analyze and push data back to S3. Last set of servers bring everything together
- No hardware load balancing in EC2.
- Services can be partitioned by user if there is a need to build out
- Case study: TC3 story
- Processes health care claims for fraud
- Daily processing
- Reprocesses last year’s claims from competitors’ customers to show savings.
- Would take 1000 servers to processes 20 million claims in 24-48 hours (hardware would cost $1 million).
- Done at most once a week.
- With EC2, 1000 servers over a few days cost $3000
- No upfront costs. Good for campaigns and services whose costs fluctuate seasonally.
- Q&A
- Q: What about sensitive information? Do you scrub the disks?
- Doesn’t really matter, Amazon doesn’t provide any audits or security validations.
- Cannot upload data that falls under HEPA to EC2.
- However, for TC3, the only identification was at the top of forms, which was sanitized before moved through.
- Amazon actually scrubs the disks
- Q: What happens if you are at peak load all the time; is there still cost savings?
- Once you factor in all the machines, power, case, cooling, and facility, it is.
- Just hiring people and the startup time, along with expertise, commitments, and leases, is a massive headache.
- Q: Has Amazon said how many CPUs they have?
- No
- What if you want 10,000 Machines?
- Doesn’t think they would have a problem, but only has experience with up to 1,000
- Do they charge by the hour based on machine or clock?
- Charged by the wall hour. You get real core, but no virtual machine
- Q: What about sensitive information? Do you scrub the disks?
- The pain of EC2 and S3– how do I.. question
- Systems are vanilla boxes – everything goes down with hardware failure.
- While must design database services around failure possibility, even the best hardware fails sometimes.
- No static IP addresses
- No load balancing or built in scaling tools
- Can get new boxes quickly, but still need to configure for service.
- Without automation, cost savings lost to necessary system administrator to do install and setup.
- S3 Consistency issue
- Store something in S3, get something else back immediately
- Only eventually consistent
- Systems are vanilla boxes – everything goes down with hardware failure.
- Cloud computing mind set: disposable computers
- Design for failure, assume machines will fail and services will need to cope with it
- Need to automate, taking things from boot to production on autopilot
- A lot need to take advantage of EC2.
- S3 SLA states 99.9% of requests are answered. EC2 SLA: 99.9% of ? are ? – big questions.
- Speaker wants the guarantee to be that he can get one more machine 99.9% of the time. Can solve a lot of issues
- Mentality should be move to another server, then diagnose problem and complaining to Amazon.
- Design for failure, assume machines will fail and services will need to cope with it
- The cost equation
- A server costs $876, programmer costs $400/day
- Start out with large number of servers and bad performance.
- Things can improve single server performance, but make it hard to go to multiple servers.
- Comments
- Much like Google. Buy the cheapest hardware
- Should design for scalability, not necessarily performance
- Ruby on Rails
- Optimizes for servers
- Database becomes the bottleneck
- HTML fragment caching
- UCSB CS290F – Fall 2006
- Students learned Ruby on Rails and built ecommerce site on EC2, scaling up to 10 application servers
- What worked: having a SaaS control panel to manage EC2
- What didn’t work: usage and server retention were much higher than expected
- Web framework standardization is saving grace
- EC2 is internet hardware commodity
- Ruby on Rails acts as a software container. Very little and standardized to provide the underneath tools to take care of everything
- Generic EC2 Website configuration
- 2 Front end boxes for apache and then followed by HAProxy (load balancing).
- Application server running a combination of Mongrel and Tomcat
- Round Robin DNS pointing to the tool
- MySQL masters with slave duplication
- Application servers on the side tied to a load balancer for scaling.
- Q&A
- How do you handle dynamic IP?
- IP is static for lifetime of the box. Keep DNS as stable as you can
- How do you handle dynamic IP?
- Demo
- Automated setup does not install from an image. Instead installs stuff on first boot.
- Installed from scripts using yum
- Conclusion
- There is a software gap
- Before EC2, big guys knew it would cost tons of money and hardware (millions of dollars towards building infrastructure)
- New picture: can get rid of costs, but still need to figure out how to manage things
- Small guys have big dream but simple hosting. Now can get scalable hardware platform
- Ahead
- Businesses betting on EC2
- Need for standardized stacks
- EC2
- Part 2: Q&A Discussion
- RAD Lab goal
- What are you going to do with a thousand servers?
- Build Ruby on Rails application that can scale up that high. Can Ruby scale? Another idea is to run a lot of small applications.
- Is it more effective to break up programs into replicable small clusters? May be more interesting to create tools to load balance and split customers that way.
- SCADS follows a similar mentality. Different kind of querying that can better partition a database.
- We are interested in the limits of scaling and the numbers we can scale up to or show the optimal size and how it compares.
- You can view the thousand servers as a cluster of machines, rather than a single clusters
- What are you going to do with a thousand servers?
- S3 vs. MySQL
- Why are you using MySQL instead of S3
- High Latency of S3. Need transactional data on something else. Ruby on Rails also currently needs an SQL database.
- What about Simple DB?
- Cannot really plug Simple DB into Ruby on Rails at this stage. However, it is great for social networking (everything relates to everything else), but not for regular tables. Eventually consistent.
- Have you considered MySQL 5.0 clustering?
- Clustering limited to memory tables, places bounds on how you can use it.
- Why are you using MySQL instead of S3
- SaaS components – non core, but lots of effort
- Users and authentication – open id may be solution
- Accounts, plans, pricing, and upgrades – messy
- Administration interface – how do you allow customer support to get components working.
- Marketing – banner ads, emails. Really statistics driven.
- Reporting – must have way to show all data, such as acquired customer value and customer activity.
- Measurements
- Amazon looks at 99th percentile
- Would be nice to have measurement tools
- EC2 is located on the east cost in different data centers
- Users spread out as much as possible to isolate failure instances.
- However, it is possible assigned live and backup server are actually on the same physical computer
- Virtual Google?
- What is the mental model of using these services?
- We can solve reliability at the system layer. We can use crappy machines.
- Few qualitative guarantees would help build reliable services
- What is the mental model of using these services?
- Reseller model?
- People are used to signing up for services. For those starting out really small, are there any EC2 reseller accounts?
- No
- People are used to signing up for services. For those starting out really small, are there any EC2 reseller accounts?
- Deployment Life Cycle
- Can do rolling upgrades, taking down one server so that you never reduce capacity beyond 1/N
- No need to upgrade server, just launch new one
- Can do rolling upgrades, taking down one server so that you never reduce capacity beyond 1/N
- Service partitioning
- Some large companies create databases for each customers since there are no database limits in MySQL
- Geographic Distribution
- Mold Ruby on Rails app for geographic distribution
- RAD Lab goal
Jan 30: "Rethinking Data Centers - Chuck Thacker" (David Poll)
Presentation (Scribe 1)
- The problem is that we don't treat Data Centers as systems, but rather as components.
- Current technology:
- Build a big building with lots of expensive power and cooling, THEN start installing computers
- If we're lucky, we use all the same type of computer, but that's not always what happens.
- Need to design for the peak load. After a year of operation, though, we're usually only at half-capacity.
- Build in remote areas, making maintainence more difficult. Near hydro dams, not near construction workers.
- Built so they're human friendly, not machine friendly
- Suggested future technology:
- Sun's Black Box
- Don't need to be human friendly
- Can be assembled at one location, computers and all
- Requires only networking, power, and cooled water, so build near a river. Need a pump, but no chiller. Might have environmental problems, though.
- Expand as needed in sensible increments.
- Only replace when enough computers rot.
- Black box
- Air circulates with chillers between the racks.
- 600 Amps of 480V power
- Problems:
- Servers are off-the-shelf Sun boxes with fans
- They go into the racks sideways, since they're designed for front-to-back airflow
- The servers have lots of packaging
- Cables exit front AND back
- Rack must be extracted to replace a server
- Rack is supported on springs, can withstand 6" drop, but maybe can do better.
- Proposal
- A 40' container holds two rows of 16 racks
- Each rack holds 40 "1U" servers, plus a network switch. Total container: 1280 servers.
- Could use 747 jet engines to power the data center
- Each container has independent fire suppression.
- Power Distribution
- Current picture:
- Servers run on 110/220V 60Hz 1 Phase AC
- Grid->Transformer->Transfer switch->Servers
- Transfer switch connects batteries->UPS/Diesel generator
- Lots and lots of batteries
- Only 40% efficient... not green.
- Another way:
- Servers run on 12V 400 Hz 3 Phase AC
- Synchronous rectifiers on motherboard make DC (with little filtering)
- Distribution chain:
- Grid->60Hz motor->400Hz Generator->Transformer->Servers
- Much more efficient (probably > 85%)
- Not a new idea -- The Cray 1 was powered this way in 1971
- Extremely high reliability
- Servers run on 12V 400 Hz 3 Phase AC
- Current picture:
- The Computers
- We currently use commodity servers designed by HP, Rackable, etc.
- Problem: low profit margin
- Let's design our own
- Minimize SKUs
- One for computes. Lots of CPU, lots of memory, relatively few disks.
- One for storage. Modest CPU, memory, lots of disks, like Petabyte?
- Maybe one for DB apps.
- Maybe Flash memory has a home here.
- Use custom motherboards
- All cabling exits at the front of the rack
- Use commodity disks
- Use no power supplies other than on-board inverters
- Error correct everything possible, and check that we can't correct.
- Minimize SKUs
- Management
- We're doing pretty well here. Automatic management is getting some traction.
- Could probably do better.
- Need finer-grained control, such as sensors to diagnose and predict failures.
- Measure more temperatures than just the CPU
- A lot's going on at MSFT to mine data for modeling
- Networking
- A large part of the total cost, and is a large fraction of the system.
- It's one of the last parts of the system that's not a commodity product.
- Router software is large, old, and incomprehensible, which is often the cause of problems.
- Serviced by the manufacturer, who's never on site.
- By designing our own, we save money AND improve reliability.
- Avoid problems of the internet in a datacenter, so different challenges.
- Differences:
- We know and define the topology
- The number of nodes is limited
- Broadcast/multicast is NOT required
- Security is simpler.
- No malicious attacks, just buggy software and misconfiguration.
- We can distribute applications to distribute load and minimize hot spots.
- We don't know quite how to do this yet, but it seems a reasonable goal.
- Design goals
- Reduce the need for large switches in the core
- Simplify the software
- One approach:
- Build a 64-node destination-addressed hypercube at the core.
- In the containers, use source routed trees.
- Use standard link technology, but don't need standard protocols.
- Don't use TCP, just use custom addressing with verification on the software-end, making the switch logic simpler.
- Mix of optical and copper cables. Short runs are copper, long runs are optical.
- Basic switches
- Type 1: Forty 1 Gb/sec ports, plus 2 10 Gb/sec ports.
- These are the leaf switches. Center uses 2048 switches, one per rack.
- Not replicated (if a switch fails, lose only 1 rack)
- Type 2: Ten 10 Gb/sec ports
- These are the interior switches
- There are two variants, one for destination routing, one for tree routing. Hardware is identical, only FPGA bit streams differ.
- Hypercube uses 64 switches, containers use 10 each. 704 switches total.
- ...
- Type 1: Forty 1 Gb/sec ports, plus 2 10 Gb/sec ports.
- Data center has 64-switch hypercube (4 4x4 sub-cubes with 16 links between them).
- 3 levels of distribution within each container
- Hypercube properties
- Minimum hop count
- Even load distribution for all-all communication
- Reasonable bisection bandwidth
- Can route around switch/link failures by reconfiguring FPGAs
- Simple routing:
- Outport = f(Dest xor NodeNum)
- No routing tables
- Switches are identical.
- Use destination-based input buffering, since a given link carries packets for a limited number of destinations.
- Link-by-link flow control to eliminate drops.
- Links are short enough that this doesn't cause a buffering problem.
- This is very similar to a Stanford proposal: Open-flow switch
- Minimize routing tables
- Routing within a container is fairly easy in the hypercube, since failures can be routed around easily
- Bandwidths
- Servers->network: 82Tb/s
- Containers->Cube: 2.5Tb/s. Not great, but considering right now we only have gigabit ethernet.
- Averages probably lower
- Objections
- "Commodity hardware is cheaper"
- This IS commodity hardware
- "Standards are better"
- "All other things being equal, standards are better", but specialization reduces cost.
- "It requires too many different skills"
- Not as many as you might think, and the deficient skills can be purchased.
- "If this stuff is so great, why aren't others doing it?"
- They are.
- This is not uncharted territory, it's something we could do, and do now.
- "Commodity hardware is cheaper"
- Conclusion
- By treating data centers as systems, and doing full-system optimizations, we can achieve good results.
- How much lower cost would this be?
- $10m to $2m
- Closed networks are popular these days, and work very well.
- Closed networks are things that can achieve full connectivity.
- Revisions could happen on roughly 3-year intervals after deployment.
- FPGAs can be used in production without much risk, and are used by folks like Cisco.
- Datacenters are copies of each other within the 3-year cycle
- Size can be scaled by changing the number of containers.
- Future: consolidation in the works
- Couple the data center with VM, allowing lots of consolidation
- Allows computing as a utility
- We now have the bandwidth and storage for this.
- EC2 and S3 are just the tip of the iceberg
Discussion (Scribe 2)
//Note: got timed out of my login on the wiki and lost some of this section... Trying to recreate some of the major points
- Datacenter + Cellphone is the SaaS future
- Hundred-dollar laptops have a particular problem with heat dissipation, making them less feasible
- Cost is still too high for 3rd world companies
- Cell phones are the future for 3rd world countries, although the most popular ones today are the inexpensive, simple (not computing-worthy) phones
- There exists a dearth of applications for these devices in the 3rd world
- One possible application: education (such as educational video games that would pull kids away from the TV)
