chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <>
Subject [jira] Created: (CHUKWA-481) Improve demux reducer partitioning algorithm
Date Tue, 27 Apr 2010 22:03:35 GMT
Improve demux reducer partitioning algorithm

                 Key: CHUKWA-481
             Project: Hadoop Chukwa
          Issue Type: Improvement
          Components: MR Data Processors
         Environment: Redhat EL 5.1, Java 6
            Reporter: Eric Yang
            Assignee: Eric Yang

Reducer partitioning for demux could be redefined to optimize for two different use case:

Case #1, demux is responsible for crunching large volumes of the same data type (dozen of
types).  It will probably make more sense to partition the reducer by time grouping + data
type (extend TotalOrderPartitioner).  I.e. A user can have evenly distributed workload for
each reducer base on time interval.  A distributed hash table like Hbase/voldermort could
be the down stream system to store/cache the data for data serving.  This model is great for
collecting fixed time interval logs like hadoop metrics, and ExecAdaptor which generates repetitive
time series summary.
Case #2, demux is responsible for crunching hundred of different data type, but small volumn
for each data type.  The current demux implementation is using this model, where a single
data type is reduced by one reducer slot (ChukwaRecordPartitioner).  One draw back from this
model,the data from each data type must have similar volume.  Otherwise, the largest data
volume type becomes the long tail of the mapreduce job.  Materialized report is easy to generate
by using this model because the single reducer per data type has view to all data of the given
demux run.  This model works great for many different application and all logging through
Chukwa Log4j appender.  I.e. web crawl, or log file indexing / viewing.
I am thinking to change the default Chukwa demux implementation to case #1, and restructure
the current demux as Archive Organizer.  Any suggestion or objection?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message