hadoop-common-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Hadoop Wiki] Update of "DeveloperOffsite20090612" by TomWhite
Date Wed, 17 Jun 2009 17:42:03 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by TomWhite:

New page:
On the Friday after the [http://developer.yahoo.com/events/hadoopsummit09/ Hadoop Summit 2009]
a group of Hadoop committers and developers met at Cloudera's office in Burlingame to talk
about Hadoop development challenges. Here are some notes and pictures from the discussions
that we had.


= Attendees =
Eric Baldeschwieler, Dhruba Borthakur, Doug Cutting, Nigel Daley, Alan Gates, Jeff Hammerbacher,
Russell Jurney, Jim Kellerman, Aaron Kimball, Mahadev Konar, Alex Loddengaard, Matt Massie,
Arun Murthy, Owen O'Malley, Johan Oskarsson, Dmitriy Ryaboy, Joydeep Sen Sarma, Dan Templeton,
Ashish Thusoo, Craig Weisenfluh, Tom White, Matei Zaharia, Philip Zeiliger

= Five things =

Things we don't like about Hadoop
 * config hell
 * unit tests need improvement
  * too few unit tests vs functional tests
 * Hudson, JIRA, patch process woes
  * too slow (hudson)
  * too much mail (JIRA)
 * Hard to debug / profile
 * Confusing / arcane APIs
  * JobInProgress
  * Dcache
 * Docs could use improvement
  * new functionality lacks specs, test plan

Things we like about Hadoop
 * community
 * open source
 * ecosystem
 * good components: HDFS, ZK

= Project challenges =

Nigel brings up:
Patch Development Process:
 * pretty convoluted
 * attach patch to JIRA, Hudson, committer takes it
 * Releaseaudit and other output unreadable
 * Running ant hudson-test-patch takes five hours, usually fails
 * Compiling is slow too

... for testing we need to know that the tests are sufficient -- this means that docs are
Test plan template: What are you testing? what are the risks? How do you test this?[[BR]]
Phil: What are examples of *good* JIRAs?[[BR]]
Doug: Should we add a test plan field to JIRA?[[BR]]
Nigel: People need to take ownership of features and consider scalability, etc.[[BR]]

The patch queue is too long...[[BR]]
Doug: committers need to be better here[[BR]]
Dhruba: doing reviews right is hard[[BR]]
Phil: What do people do here?
 * read patch by itself
 * apply and use a diff viewer
 * Can we use a review tool -- e.g., FishEye, Reviewboard? Phil will take point on this
Doug: We need a "wall of shame" to incentivise people to review other peoples' code rather
than just write new patches of their own

Pushing Warnings to Zero:
 * Running test-patch is slow
 * "rat" has verbose output, needs an overrides system
  * nobody really volunteering to fix this
 * checkstyle has 1200+ warnings =-- this is absurd
 * Need ECliupse settings file that reflects reality
 * Patch system disincentivizes "janatorial work" to lines near what you're really working

Typo Fixes, etc:
 * Can we commit things without a JIRA?
 * Can we commit things without a review?

Most things require both[[BR]]
Exceptions: www site, rolling a release[[BR]]
How do you continuously refactor?[[BR]]
Phil : Can we agree that it's ok to fix comments/typos/etc in the same file you're working
Doug Yes, but people need to say that this is what they're doing in the JIRA comments[[BR]]
Tom: This should be clarified on the HowToCommit wiki page[[BR]]

 * Eclipse style file (Phil)
 * NetBeans style file (Paul)
  * both of these should be in top level and included via svn-extenrals
 * Tune checkstyle config (Tom)
 * Wholesale codebase sweep
  * Need to clear patch queue first?
 * Per-project Core (Owen) / MR / HDFS -- who will take care of the last two here?
 * Enable in patch testing (Nigel)

Matt: Could we think about an auto-reformatting patch service?
(lukewarm reaction here)

 * We have 10 new machines which can be used soon
 * Need "ten minute test" so that individuals can run this
  * Some MR folks are removing MiniMR from tests where it's not needed
 * Phil: We need a build sheriff to nag on build/test breaks
 * Nigel: need continuous build on all active branches on commit
  * not just patch-isolated tests
  * need code coverage (and diffs of this)

 * Functional tests are not first-class citizens
  * need a harness for running shell scripts, etc.
  * maybe run on EC2? Alex will do some work here

 * Phil will work on improving docs related to test utils.

 * We should start using some mock framework
  * JMock, EasyMock, etc
 * LocalJobRunner needs improvement
 * TestNG allows for shared MiniMRCluster which will help
  * This should do things like *not* start Jetty, etc.

Build System:
 * Ivy vs. Mason
  * Nigel is working with someone to build a POM repo
  * Should we convert to mvn?
  * How bad is Ivy itself vs. just our usage of Ivy?

 * Pig and Hive both do something different for depending on Hadoop

 * Forrest needs to go.
  * Todd will investigat migrating soimewhere else
  * There are other XML-based documentation systems

= Wish Lists =

== MapReduce ==
 * JT API (e.g., REST)
  * pluggable job submission API
  * InputSplit generation as a task
 * Simplify JT/TT interaction
 * MR 2.0:
  * Separate persistent rsrc manager from the per-job manager
 * Avro input format
  * For Pipes, too
 * Preemption and better priorities
 * JobInProgress refactoring
 * JSON Job History
 * Pipeline MR (M*R*)
 * Streaming in mapred, not contrib

== MapReduce and HDFS ==
 * Separate client/server (interface/implementation) jars, packages

== HDFS ==
 * Append and 4379 (flush/sync)
 * NN metadata inspection tool (already exists in some form in 20)
 * Multi-NN for same DNs (federated NN)
 * Multi-DC Hadoop
 * HDFS mounts + symlinks
 * pluggable block placement
 * DFSClient requery NN if it can't find a block

== Build And Test ==
 * One framework for system testing (loca,l and in the cloud)
 * Standard build and dep managemtn framework
 * Failure injection
  * Based on Aspect/J
 * Back compatibility and verificatino framework
  * Need ability to run big sets of real jobs without apache necessarily owning those jobs
  * A messaging issue to the community is that non-committers can and should independently
test and then vote +1/-1 on release candidates
 * Symlink test suite

== Avro ==
 * language-agnostic RPC (C, C++, Ruby, Python...)
 * column store format in Avro
  * or put in pig/contrib?
 * Standard text format (currently only has a binary fmt)
  * JSON based
 * RPC proxy for testing

== Common ==
 * config variable names, uniform schema
 * better errors for bad config
 * range testing for config values
 * LDAP-based config
 * Config should support lists of elements, and not just a big string "value" wiuth commas
 * remove non-public things from public API

== Pig ==
 * decent error messages
 * column store format in Avro

= Subproject discussions =

== MapReduce ==

== HDFS ==
 * Append/Sync : It is being targeted for 0.21 release.  Project status "yellow".... meaning
that it could be "at-risk" for this release.
 * HADOOP-4379 : seems to satisfy the append-tests that the Hbase team is doing. The Hbase
team is requesting that this jira be included in some sort of a Hadoop release.
 * HDFS mounts and symlinks : HADOOP-4044. This is very much essential for federating a flock
of hadoop clusters. This is targeted for 0.21 release.
 * NameNode scaling to large number of files: One idea that was discussed was to pagein/pageout
metedata from the namenode memory on demand. This introduces aditional complexity in the NN
code, especially needs fine-grain locking of data structures. This is not going to be attempted
in the short term. Federating the hadoop namespace across multiple hadoop namenode (using
symlinks) would be a way to solve the problem of a "large number of files".
 * DFSClient read latency performance: It was discovered that the performance bottleneck was
because of a single thread doing sequential reads from one data block followed by the next.
An alternative approach would be to open multiple DFSClient instances, and make each instance
read different blocks in parallel. This can improve the latency of a single file-read tremendously.
Todd Lipcon will try this out.

== Pig/Hive ==
Approaches to column storage were discussed. The Hive and Pig implementations are fundamentally
different -- Hive uses a PAX-like block-columnar approach, wherein a block is organized in
such a way that all values of the same column are stored next to each other and compressed,
with a header that indicates where the different columns are stored inside the block.  This
file format, RCStore, can be used by Pig by virtue of writing a custom slicer/loader.   The
Pig representatives indicated that they were interested in experimenting with RCStore as an
option for their workloads.

The Pig model splits columns into separate files,  and inxeses into them; there is a storage
layer that needs to coordinate the separate files. The layer is (naturally) able to do  projection
pushdown, looking into also pushing down filters.  Pig integration is in progress.  No technical
impediments to implementing a Hive SerDe for Zebra.

Sharing code and ideas would be easier if there was a common place to look for this sort of
stuff. For example, RCStore can be pulled out of hive -- but where? Commons? Avro?

Neither project has plans for storage-layer vector operations (a-la vertica).

Metadata discussion pushed off to next week.

Join support is roughly equivalent for both systems (map-side aka FRJoin, and regular hash
join).  Neither supporting or planning on bloom joins (no real demand -- be it because the
users aren't familiar with it, or because the workloads don't need it, is unknown).  Either
way, "we accept patches ;-)"

Blue-sky stuff:
Oozie integration -- a way to define a hive node? Some other custom processing node?  Having
Hive and Pig clients automatically push queries off as Oozie jobs so that they are restartable,

Pig is looking to move off JavaCC.  Currently leaning in the direction of cup+jflex ; hive
uses Antlr, which is now a lot less memory-intensive as they've started using lookaheads.
This information might be putting Antlr back on the table for Pig.

== Avro/Common ==

=== Configuration ===

Registry class for configuration that allows you to "register" configuration 
 * Allows for easier documentation and more control 
 * Allows for warning for unknown keys to help with typos
 * Give support for description of units and range testings
 * Do before 1.0
 * Push the value validation into the registered class 
  (e.g. range tests, units, allows for <elem></elem>, umask in octal (HADOOP-3847)
 * Allows for explicit tagging/filtering of "stable", "experimental" and "deprecated" 

Hadoop configs should be read from a distributed filesystem (HADOOP-5670)
 * Read from LDAP, ZK, HTTP

=== Avro ===
 * C/C++ Support
 * Single RPC Mechanism across all Hadoop processes/daemons
 * Language Agnostic RPC w/ versioning and security
 * RPC Proxy Object (e.g. fault-injection, light-weight embedding)
 * To/From Import/Export JSON

== Build And Test ==

Top Level components Nigel proposed:

 * Backwards Compatibility Testing
 * System Testing Framework
 * Patch Testing Improvements
 * Test Plan Template (for JIRA)
 * Mock Objects

Backwards Compatibility Testing:

 * API (syntactic)
  * Static Analysis
 * API (semantic)
  * We need scrutiny during code review when compatibility breaks; ways to determine if compatibility
is broken during CR:
   * Be strict with the JIRA incompatible change flag
   * jdiff changes
   * Javadoc changes
   * Test changes (e.g., a JUnit test changes)
  * Other ideas:
   * Community-driven testing
    * Either allow users to submit a workflow and test their workflow for them, or
    * Just have users submit to a website if compatibility is good for them
     * Think of a long checklist, where each item is a volunteer running Hadoop that says
if compatibility is good or not
   * Write API tests around 1 or 2 key APIs
    * Get well-defined specs for these APIs
    * Test all assertions
   * Sub-project test suites can help a lot
    * Solr, Pig, Hive, Mahout, HBase, Chukwa
    * Sub-projects can be part of the community idea mentioned above
 * Protocol
  * Not 0.21, because of Avro
 * Config
  * Syntax-static analysis (diff), also semantic tests apply
 * Data

System Test Framework:

 * Shell framework (perhaps a derivative of Y!'s)
 * Run on cluster (possibly AWS) or in local debug mode
 * Can run system and performance tests

Patch Testing:

 * RAT - need more control
 * Code coverage diff (really interesting)
 * Better test detection
  * +0 if they have a test/ folder
  * +1 if they have test/**/Test*.java
  * +1 if they have a new @test or "test*(" in a test/ file
  * -1 otherwise
 * Other tools (JDepend, Classycle)
 * Sonar (would run alongside Hudson)

Test Plan Template:

 * New features have the template in the JIRA
 * Nigel has done some work on this

Mock Objects:

 * (nothing discussed)

View raw message