Projects/Monitoring Hadoop through Tracing/Project Journal

From RAD Lab

Jump to: navigation, search

Contents

Project Page Link

You're reading our project journal, also check out our Project Homepage

October 3, 2007

Enlarge
9:00p - 11:00p, Partner Programming session

We have more or less finished adding X-trace to Hadoop and are now able to view large beautiful graphs of a simple map reduce job (a word counting job that came with Hadoop as an example).

See a big bushy trace in action here: http://viz.x-trace.net/cgi-bin/graph_report.pl?&id=BADC0DE0ABB425AB

October 7, 2007

10:00p - 10:30a, Chatted with Kuang about his project

Kuang is doing a project with P2, an overlay network. He wants to create a distributed X-trace query engine where we take the computation involved in querying to the nodes, very much in the spirit of hadoop. He is hopefully going to be meeting with the SIG Trace group tomorrow to pitch the preliminary ideas for his project.

October 8, 2007

10:00 - 10:30a, Meeting with Ion Stoica about progress and next steps

Tasks:

  • Get our hadoop running on the i3 cluster
  • make a list of obvious failure models to run
    • complete node failure (single and multiple)
    • packet losses or packet delays
    • ...
  • get a real workload running on it nutch
  • measure tracing overhead (time to run job with tracing turned on, then with tracing turned off (does it scale?)

October 13, 2007

Enlarge

We set up Hadoop on the RAD Lab cluster!

Command to start X-Trace on Hadoop slaves and master:

for i in `cat conf/slaves conf/masters`; do ssh r$i 'nohup /work/matei/xtrace-trunk/sbin/FrontendDaemon >& /scratch/matei/xtrace-trunk/fe.log &'; done

October 14, 2007

First correct multinode DFS write X-Trace: http://viz.x-trace.net/cgi-bin/graph_report.pl?&id=DF5000006D3F89D0.

And a multi-block write (200 MB file): http://viz.x-trace.net/cgi-bin/graph_report.pl?&id=DF5000009FD7493C.

Multi-node MapReduce with 4 mappers and 4 reducers: http://viz.x-trace.net/cgi-bin/graph_report.pl?&id=BADC0DE0B8E52103.

October 17, 2007

Enlarge

We have set up Nutch to use our Hadoop running on the 40 node RAD Lab cluster! We are crawling the RAD Lab website.

We also have the first trace of a failed job (MapReduce with 1 mapper and reducer): http://viz.x-trace.net/cgi-bin/graph_report.pl?&id=BADC0DE0C4AFF253. Notice the four retries on the Reduce task. The Hadoop overview page for the job is shown in the attached image.

Finally, we set up a trip to Facebook next Tuesday.






October 22, 2007

5:00p - 9:00p - RAD Lab hacking session

Matei and I got Facebook's Thrift instrumented using X-trace tonight. Check out our traces of an java to java RPC call so far: check out an actual thrift trace










We also figured out the answer to a question we've had about some weird behavior of hadoop's map-reduce. We were noticing that map and reduce tasks were failing (actually, getting killed) way more than we expected, like multiple times per relatively small map-reduce job. The behavior can be seen in this graph (also see thumbnail and the the source of the thumbnail, i.e. the hadoop failed jobs).








Then we discovered, by looking at the Hadoop Tasktracker logs and also by inspecting the cpu utilization across the cluster (see the thumbnail) that the nodes which were failing were the nodes which were most heavily loaded (see the red and orange "ganglia" graphs in thumbnail or go here to see the RAD Lab cluster status). These buys nodes were running some other students workloads and running at ~350% (out of 400% since the nodes are dual dual core or quad core [i can't remember]) cpu utility. The Tasktracker logs revealed that the map or reduce tasks assigned to these busy nodes were reassigned to another node exactly one minute after the original assignment to the busy node. The 2ndary assignments were able to finish before the originally assigned jobs completed so the namenode (the master) would kill these "laggards". We found this behavior FASCINATING!

October 23, 2007

10:30a - noon :: Meeting with Facebook

The gang on our fieldtrip (George, Kuang, Matei, Gunho)
Enlarge
The gang on our fieldtrip (George, Kuang, Matei, Gunho)
They hired a professional graffiti artist to paint their walls  (Matei, Kuang, George)
Enlarge
They hired a professional graffiti artist to paint their walls
(Matei, Kuang, George)
A group of us went on a field trip to Facebook's headquarters in Palo Alto today. We had a great time and learned a lot. They are interested in the Hadoop tracing we are doing and verified that they see some of the same issues we do such as a potentially significant (performance-wise) non-scaling job setup overhead for any map reduce.


Matei and I also met to discuss the next steps for our project, reviewing our "plan" which was part of our class slides. Here are the key areas we identified during our meeting, displayed in prioritized fashion:

  • Workload generation
    • Get our own mirror of wikipedia to crawl
    • Crawl the entire UC Berkely site
    • Crawl the web
  • Improved Visualization
    • Add links to visualization to hadoop control panel and vice versa
    • Make a network message flow graph (time on vert axis, arrows moving diagonally down) by changing the perl script that generates the graph vis graph right now.
  • Trace Analysis
    • compute simple table statistics on trace data by TaskID (count, sort ASC & DESC by all fields, split up the blob we are working with now in the value key pair model into something useful to work with)
    • compute critical paths - for map-reduce branch that converges last
  • Swap out back end for something better suited than Posgres such as Scribe
  • After we get scribe, possibly switch to thrift for ipc (so our model is just like facebooks), make a nice package we can send to facebook for testing on a big workload that they have agreed to send us the [sanitized] results from.

October 25, 2007

9:30a - noon :: towards tracing statistics

As a first step in the direction of doing some more complicated and hopefully more insightful machine learning on the traces we are currently generating in our instrumented version of Hadoop, we have decided to begin with a sortable table of per-taskID trace information. Today we met to plan our attack on this project milestone. We have decided to use write an analysis library which we can call from an ERB web-frontend to display pretty statistics and related graphs on a web control panel. We are not using Rails for now, but decided to start of simple, Ruby off Rails style.

Other than improving our trace analysis mechanism, our next major goal is to improve the trace visualization tools we are currently using (i.e. graph viz generating svg images). The current solution will not scale with the size of our eventual traces of thousands of nodes in the datacenter. Thus, we set up a meeting with Jeff Heer, a local Berkeley graphing pro. See prefuse.org for his work on Java-based visualization tools and flare.prefuse.org for his flash base tools.

Enlarge

October 28, 2007

noon - 3:00p :: SVN set up

We are now using SVN for our work on generating some simple statistics which we are actively coding in Ruby. We are also working on generating more functional graphs in Adobe's Flex.

Check out our SVN repository


7p - midnight

We now have the first iteration of some actual statistics on hadoop traces. Here is an actual example (running on our hand coded web server using Ruby off the Rails by the way) http://r17.millennium.berkeley.edu:7000/?taskid=BADC0DE03F8C3A1C Simple, we know, but it's a start. Soon enough, we'll be doing linear regression and EM. That link might go down when we get back to work on this tomorrow, so don't be sad if it is broken. Remember, if you want to see the magic behind the trace summary, check it out of the SVN

November 7, 2007

Enlarge

X-Trace graphs, stats and text reports are now accessible directly from the Hadoop MapReduce control panel. Additionally, we are working towards support for multiple backends for X-Trace. We are also in the process of setting up a Wikipedia mirror for getting longer jobs traced (we want to index Wikipedia using Apache Nutch).

November 14, 2007

Enlarge

We have the first of several planned graphs in the X-Trace Statistics page. This graph shows machine utilization over time, and it can be especially interesting in tasks where either the reduce part or the map part takes the bulk of the time.

We have also crawled about 1.5 million pages of Wikipedia text which we will index as our "large workload". Further crawling is still going on as we work on statistics and other aspects of X-Trace.

Finally, George Porter and Lisa Fowler are continuing work on integrating X-Trace with different logging backends.

November 16, 2007

Enlarge

We've made a number of improvements to the trace analysis web UI in the past few days:

  • The script now lists the latest tasks if given no argument (example: http://r17.millennium.berkeley.edu:7000/)
  • New graphs: map duration distribution, reduce duration distribution, and number of active mappers and reducers versus time. These graphs really point out the effect of outliers. See the attached picture for an example showing both lagging maps and reduces that time out and set the entire computation back.
  • Critical path analysis for the longest map and longest reduce, showing which phase of the computation they spent their time in. This is especially interesting because it varies between job types.

We have also crawled about 2 million Wikipedia pages and are beginning to run indexing jobs on them.

November 22, 2007

There's been work on a number of fronts in the past few days:

  • We've tried the first few really large traces (~5000 tasks and ~400,000 X-Trace events) and discovered a few scaling issues with X-Trace, then fixed them of course. One interesting one was that the 32-bit random operation IDs we were using are fairly likely to collide after about 50,000 events by the birthday paradox, so we had to implement bigger operation IDs to get meaningful large traces.
  • As part of this, we've also written a second, standalone X-Trace backend that consists of a single Java process which uses BerkeleyDB for storage of reports and the Jetty embedded HTTP server to provide a web interface for listing latest tasks and retrieving reports for a specific task. This is easier to work with than the old PostgresSQL backend, fast, and also easy to set up. It will probably turn into the default backend provided with the next X-Trace release.
  • We are adding more instrumentation and starting to design experiments to run now that the tracing and trace analysis are set up.

November 27, 2007

Enlarge
We are still working hard with a class poster session tomorrow and a project paper due in a couple of weeks. Here is what we've been up to:
  • Crafting up a logo
  • Still adding more statistical analysis tools (like more graphs)
  • Refining our instrumentation of Hadoop (e.g. we improved the OP_WRITE_BLOCK tracing to collect more accurate timing for datanode writes and added graphs summarizing those timings)
  • Actively contributing to the X-Trace development community (which is steadily maturing, with a new wiki, developers mailing list, etc.)
  • Exploring closer collaboration with CMU researchers who are also working with Yahoo!, Google, and IBM.
  • Running larger and larger jobs on our clusters here at Berkeley.
  • Trying our patch on older Hadoop versions to detect bugs.

December 2, 2007

DFS can be slow even for small reads.
Enlarge
DFS can be slow even for small reads.
Write performance of different machines. The second machine turned out to have a faulty hard drive.
Enlarge
Write performance of different machines. The second machine turned out to have a faulty hard drive.

As the term wraps up and this project is the only thing left on our agendas, we are getting neat results in a number of areas. Some developments include:

  • Finding a slow machine in our cluster using tracing, which was confirmed to be having disk problems.
  • Rolling back our Hadoop version to 0.14.0 and showing that tracing can identify a known bug in that version.
  • Taking a closer look at DFS performance. One interesting result from traces is that DFS reads and writes can take an enormous amount of time (up to 5 minutes), almost regardless of data size. We've created a number of graphs plotting performance vs different axes to track this down. One cool thing about path-based tracing is that we can plot graphs from old traces just as easily as from new ones and go back in time to see how old jobs performed. While some of these slow IO operations may be due to gradual reads by the map tasks, there are definite cases in which small reads at task startup are slow or in which writes (which happen one chunk at a time, without streaming) are slow. We even found an example with 200-second small reads where the job takes about 40% longer than average to finish because the tasks take minutes to start. We are currently trying more systematic ways of finding the cause of this problem as well as techniques to improve the performance.
  • Building contingency tables of RPC's and DFS operations from traces, which are a powerful tool for automatically detecting buggy behavior using machine learning. The idea is to then compare contingency tables for different jobs using a chi-square test to detect if they are significantly different.

December 6, 2007

An example of the contingency table
Enlarge
An example of the contingency table
Towards the goal of doing machine learning, we have added a contingency map to our statistics page. Basically, it shows a count (and more detailed timing statistics which are collapse-able by clicking) of pairs of DFS and MapReduce functions that happen sequentially in a trace. The next step will be to do many runs of a particular medium sized job, then we will compute an average for each cell (or a set of averages for each cell) in the contingency map and use those averages as the expected value for a Chi-Squared test to detect anomolies in the contingency table.

Also, we met with the rest of the X-Trace team today to discuss the next release of X-Trace, many of the features of which we have already helped contribute, such as an improvement of changing the OpID from 32bit to 64bits, and the suggestion to use a cryptographically secure random number generator for the OpID as well.

Jan 31, 2008

After our winter-break vacations, we are back into our work on Hadoop/X-Trace. We finished our Internet Data Centers course with Randy Katz and our Machine Learning course with Michael Jordan and wrote a paper for each course:

Mar 21, 2008

We are going to be speaking at the first Hadoop Summit! It looks like attendance is full, but you can still sign up to watch online.

Mar 23, 2008

We're gonna be on the radio!

The Graduates is a weekly KALX show dedicated to graduate student research. On the show, Stephanie Gerson interviews graduate students across campus about their work on topics ranging from web-enabled paintball guns to the effect of environmental, health, and social information on consumer behavior. April 14th features an interview with Andy, a PhD student from Computer Science, about measuring and improving the performance of programs that run on super sized computer clusters. Visit "The Graduates KALX" on Facebook, and tune in!

Mondays 12:00-12:30
KALX (90.7) or streaming off the web: http://kalx.berkeley.edu
Facebook page: http://tinyurl.com/2fvkte

Mar 26, 2008

The Hadoop Summit yesterday was a huge success. Both for the future of Hadoop at large, and for the use of X-Trace as a tool for software debugging, hardware failure detection, and performance optimization. Here are the slides from the talk Andy gave: (PDF PPT)