Berkeley Hadoop SIG

From RAD Lab

Jump to: navigation, search

Contents

Logistics

Mailing list: hadoop@lists.eecs.berkeley.edu

Members: Anthony, Randy, Joe, Mike, Andy, Matei, Rodrigo, George, Ganesh, Neil, Tyson, Peter, Rusty, Michael (Add yourself here!)

Meeting Time: Monday at 4pm

Description:

In the last year or two there has been a considerable amount of research done at Berkeley directly or indirectly related to the Hadoop open source implementation of MapReduce and distributed file system. We would like to organize a brief weekly meeting for those who are interested in keeping up to date on other Berkeley projects that are using Hadoop and also for facilitating collaboration between such projects.

Past projects:

Current projects related to Hadoop:

  • P2 under MapReduce (Tyson)
  • P2 under a DFS under Hadoop MR (Neil and Peter)
  • Job scheduling for shared Hadoop clusters (Matei)
  • Chukwa large-scale metrics+trace+data collection, analysis, visualization framework (Ari)
  • Using Chukwa (and in turn Hadoop) for ML log mining (Wei)
  • Towards Energy Efficient MapReduce (Yanpei and Laura)
  • Hypertable (SCADS)

Proposed projects

Other Ideas for projects related to Hadoop

  • Factoring in resource utilization (CPU load, memory, etc) to scheduling decisions
  • Speeding up small jobs in Hadoop (Ari?)
  • Archival storage in HDFS using high-disk nodes (Thumpers) -> is there a way to power-manage the disks on the Thumper to make this cost-efficient compared to a NAS?
  • Using Flash storage to speed up MapReduce or HDFS
  • A storage management abstraction for Hadoop enabling pipelining, selective caching, etc
  • Reliable and scalable continuous/streaming computations using Hadoop
  • Are there extensions we can make to the MapReduce programming model that will enable more applications? (Eric Brewer)
  • Rolling lessons learned from Hadoop into an open source Hadoop successor
  • Using an LFS underneath Hadoop MR (similar to Zebra FS)
  • Implementing Network or disk IO isolation in Xen
  • Memory only Hadoop (or flash disks only)
  • Exploring RAID5-like techniques for reliability in HDFS with less storage overhead than storing k copies (erasure coding, etc.) (Neil)
  • Exploring connections between Hadoop and parallel relational databases (Neil)

Meeting Agendas/Minutes:

Meeting Mon Mar 30

  • Hadoop Summit plans
  • Update on Benchmarking project, Yahoo logs

Meeting Mon Mar 2

  • Tyson talks about their replacing the brain with P2 project
  • Peter quick overview of the HDFS using P2 project
  • they are looking to share data between the two
  • partition the dfs master
  • parallelize the tasktracker operations, the queries are the bottleneck.
  • also doing hive optimizer, similar to shared scans
  • charles r.: characterizing variability
  • Who to talk to about benchmarking background:
    • Mahoul @ hp labs
    • Meet with Joe
    • Maybe we want to normalize by dollar, mahoul is looking at energy, joulsort (jobs/carbon-unit)
    • what about variability!? 99.9% SLA. DB benchmark usually requires three consecutive runs! == Repeatability
  • Ari is now a committer on chukwa, which is now a full fledged apache sub project of hadoop (before it was a contrib project)
  • Update from yahoo about their efforts on terasort
  • use the mailing list! keep each other up to date on hadoop related news
  • Digging into lower OS levels, disk caching, network, etc. Tuning an EC2 AMI? Replacing the guts of hadoop like tyson did but with low level libs (e.g. c routines).
  • Randy's idea: a closer look at the tradeoff of how soon to identify something as failed vs. the overhead of false positives.
  • Matei: seeing all sorts of reactions to configurations. # of threads, amount of disk vs memory (hetero disks and file block distribution).
  • Randy: goal is to understand what the sources of variability are... this is the idea behind charles' project.
  • Tyson: they want the system to be more fully aware of failures and their causes (masking should be from the app developer, not the system), they keep /proc files in the relations
  • Anthony: what about having hadoop remember history of previous tasks
  • Ari: making hadoop handle ML jobs better.

Meeting Mon Feb 23

  • Can we build the successor to Hadoop?
    • What would we want to change?
    • What did Dryad do right? Wrong? What can it do that MR can't (which apps?)
  • What can we contribute that Yahoo/Facebook are/can not?
    • A completely new system, since they are restricted to just improving Hadoop
    • Arun/Owen has been talking about Hadoop2.0
  • Can we build just a new MR (and not HDFS, or re-use HDFS)?
  • Yahoo has expressed a need for a workflow manager (maybe there is a JIRA here)
  • Anthony: what are the shortcomings with current (Hadoop MR) compute models that big-data
  • find out from NERSC people and ML people what workloads don't fit into MR (but maybe fit into Dryad? or Hadoop 2.0)
  • What are the metrics for a successful Hadoop? (Archana)
    • Power efficiency
    • Programmer productivity (overhead to get serial code into this paradigm)

Comments on project ideas

  • Anthony likes flash storage project idea

Brief overview from everybody present on their projects or ideas

  • Michael - better storage atomics, like BDB for distributed systems
  • Neil - summary of their Hadoop projects - Job scheduling in Hadoop and new P2 file system from scratch
  • Archana - expanding her previous work. Given a search space of configurations and a given workload, figure out (guess w/ ML) how it will perform, and adjust the configuration automatically.
  • Charles - looking for a project still, probably something related to "how people really use hadoop"
  • Ganesh - interested in learning more about Hadoop
  • Yan Pei - has two projects, power and incast
  • Ari - optimizing short jobs for ML students

ideas for where to go with this group

  • bring in undergrads?
  • help ML students give us traces of non-traditional jobs
  • bring in an ML student/person who can tell us what they would want
  • bring in a scientific computing person to tell us what they would want to do with Hadoop

First draft for a Mission Statement for this SIG

  • To understand how Hadoop works, and what its limitations are
  • To understand the problem domain, especially what parallel apps exist (ML, HPC, Web)

Announcements:

  • LATE patch is being reviewed by Yahoos (Devaraj and Arun who is doing terasort to compete better with Googles terasort)

Meeting Wed Feb 11

Discussed creating a SIG. Need to contact other people who might be interested,come up with game plan for hadoop research agenda, list existing Hadoop projects.