hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian CAPDEFIER <chivas314...@gmail.com>
Subject Re: chaining (the output of) jobs/ reducers
Date Tue, 17 Sep 2013 13:19:38 GMT
Thanks Bryan. This is great stuff!


On Thu, Sep 12, 2013 at 8:49 PM, Bryan Beaudreault <bbeaudreault@hubspot.com
> wrote:

> Hey Adrian,
>
> To clarify, the replication happens on *write*.  So as you write output
> from the reducer of Job A, you are writing into hdfs.  Part of that write
> path is replicating the data to 2 additional hosts in the cluster (local +
> 2, this is configured by dfs.replication configuration value).  So by the
> time Job B starts, hadoop has 3 options where each mapper can run and be
> data-local.  Hadoop will do all the work to try to make everything as local
> as possible.
>
> You'll be able to see from the counters on the job how successful hadoop
> was at placing your mappers.  See the counters "Data-local map tasks" and
> "Rack-local map tasks".  Rack-local being those where hadoop was not able
> to place the mapper on the same host as the data, but was at least able to
> keep it within the same rack.
>
> All of this is dependent a proper topology configuration, both in your
> NameNode and JobTracker.
>
>
> On Thu, Sep 12, 2013 at 3:02 PM, Adrian CAPDEFIER <chivas314159@gmail.com>wrote:
>
>> Thanks Bryan.
>>
>> Yes, I am using hadoop + hdfs.
>>
>> If I understand your point, hadoop tries to start the mapping processes
>> on nodes where the data is local and if that's not possible, then it is
>> hdfs that replicates the data to the mapper nodes?
>>
>> I expected to have to set up this in the code and I completely ignored
>> HDFS; I guess it's a case of not seeing the forest from all the trees!
>>
>>
>>
>>  On Thu, Sep 12, 2013 at 6:38 PM, Bryan Beaudreault <
>> bbeaudreault@hubspot.com> wrote:
>>
>>> It really comes down to the following:
>>>
>>> In Job A set mapred.output.dir to some directory X.
>>> In Job B set mapred.input.dir to the same directory X.
>>>
>>> For Job A, do context.write() as normally, and each reducer will create
>>> an output file in mapred.output.dir.  Then in Job B each of those will
>>> correspond to a mapper.
>>>
>>> Of course you need to make sure your input and output formats, as well
>>> as input and output keys/values, match up between the two jobs as well.
>>>
>>> If you are using HDFS, which it seems you are, the directories specified
>>> can be HDFS directories.  In that case, with a replication factor of 3,
>>> each of these output files will exist on 3 nodes.  Hadoop and HDFS will do
>>> the work to ensure that the mappers in the second job do as good a job as
>>> possible to be data or rack-local.
>>>
>>>
>>> On Thu, Sep 12, 2013 at 12:35 PM, Adrian CAPDEFIER <
>>> chivas314159@gmail.com> wrote:
>>>
>>>> Thank you, Chris. I will look at Cascading and Pig, but for starters
>>>> I'd prefer to keep, if possible, everything as close to the hadoop
>>>> libraries.
>>>>
>>>> I am sure I am overlooking something basic as repartitioning is a
>>>> fairly common operation in MPP environments.
>>>>
>>>>
>>>> On Thu, Sep 12, 2013 at 2:39 PM, Chris Curtin <curtin.chris@gmail.com>wrote:
>>>>
>>>>> If you want to stay in Java look at Cascading. Pig is also helpful. I
>>>>> think there are other (Spring integration maybe?) but I'm not familiar
with
>>>>> them enough to make a recommendation.
>>>>>
>>>>> Note that with Cascading and Pig you don't write 'map reduce' you
>>>>> write logic and they map it to the various mapper/reducer steps
>>>>> automatically.
>>>>>
>>>>> Hope this helps,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>> On Thu, Sep 12, 2013 at 9:36 AM, Adrian CAPDEFIER <
>>>>> chivas314159@gmail.com> wrote:
>>>>>
>>>>>> Howdy,
>>>>>>
>>>>>> My application requires 2 distinct processing steps (reducers) to
be
>>>>>> performed on the input data. The first operation generates changes
the key
>>>>>> values and, records that had different keys in step 1 can end up
having the
>>>>>> same key in step 2.
>>>>>>
>>>>>> The heavy lifting of the operation is in step1 and step2 only
>>>>>> combines records where keys were changed.
>>>>>>
>>>>>> In short the overview is:
>>>>>> Sequential file -> Step 1 -> Step 2 -> Output.
>>>>>>
>>>>>>
>>>>>> To implement this in hadoop, it seems that I need to create a
>>>>>> separate job for each step.
>>>>>>
>>>>>> Now I assumed, there would some sort of job management under hadoop
>>>>>> to link Job 1 and 2, but the only thing I could find was related
to job
>>>>>> scheduling and nothing on how to synchronize the input/output of
the linked
>>>>>> jobs.
>>>>>>
>>>>>>
>>>>>>
>>>>>> The only crude solution that I can think of is to use a temporary
>>>>>> file under HDFS, but even so I'm not sure if this will work.
>>>>>>
>>>>>> The overview of the process would be:
>>>>>> Sequential Input (lines) => Job A[Mapper (key1, value1) =>
>>>>>> ChainReducer (key2, value2)] => Temporary file => Job B[Mapper
(key2,
>>>>>> value2) => Reducer (key2, value 3)] => output.
>>>>>>
>>>>>> Is there a better way to pass the output from Job A as input to Job
B
>>>>>> (e.g. using network streams or some built in java classes that don't
do
>>>>>> disk i/o)?
>>>>>>
>>>>>>
>>>>>>
>>>>>> The temporary file solution will work in a single node configuration,
>>>>>> but I'm not sure about an MPP config.
>>>>>>
>>>>>> Let's say Job A runs on nodes 0 and 1 and job B runs on nodes 2 and
3
>>>>>> or both jobs run on all 4 nodes - will HDFS be able to redistribute
>>>>>> automagically the records between nodes or does this need to be coded
>>>>>> somehow?
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message