hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: Stackoverflow
Date Tue, 03 Jun 2008 19:00:49 GMT

Chris,

Your version will use LongWritable as the map output key type, which
changes the job nature completely. You should use 
${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
>    -r 88 \
>    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
>    -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
>    -outKey org.apache.hadoop.io.Text \
>    -outValue org.apache.hadoop.io.Text \
>    <input dir> <ouput dir (ignored)>
instead.

Runping

> -----Original Message-----
> From: Chris Douglas [mailto:chrisdo@yahoo-inc.com]
> Sent: Tuesday, June 03, 2008 11:35 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Stackoverflow
> 
> >> By "not exactly small, do you mean each line is long or that there
> >> are many records?
> >
> > Well, not small in the meaning, that even I could get my boss to
> > allow me to
> > give you the data, transfering it might be painful. (E.g. the job
that
> > aborted had about 12M lines with with ~2.6GB data => the lines are
> > not really
> > long, but longer than 80 chars)
> 
> Ah, I see. Would it be possible to run the Java sort example over
> your data? It would be helpful to verify that this is not specific to
> streaming.
> 
> ${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
>    -r 88 \
>    -inFormat org.apache.hadoop.mapred.TextInputFormat \
>    -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
>    -outKey org.apache.hadoop.io.LongWritable \
>    -outValue org.apache.hadoop.io.Text \
>    <input dir> <ouput dir (ignored)>
> 
> This should be close to streaming with cat as the mapper.
> 
> >> util.QuickSort is only used on the map side, so this shouldn't have
> >> anything to do with the reduce. Is it always and only the *last*
map
> >
> > Nope, although sometimes it happens earlier.
> 
> Is it always the same splits when you re-run your job? Though
> distributing the full dataset may not be feasible, if there are
> splits that fail consistently then we might be able to work from that.
> 
> >> task that fails? If I sent you a patch that would print a trace
with
> >> the partitions, would you mind running it? Do you have any other
> >> settings that differ from the defaults? -C
> >
> > If you tell me how to apply it, I'm happy to. (I'm not the biggest
> > Java
> > hotshot on this planet, I'm just using the provided 0.17.0 jars,
> > Guess I
> > would have to patch the source and run ant. On all nodes or just the
> > control?).
> 
> Unfortunately, it would need to be deployed to all the TaskTrackers,
> and it would be pretty invasive (i.e. I was planning on logging all
> the offsets from the sort as the stack unwinds from the exception).
> I'll test something and send it to you, and if it's not too much
> trouble you can try it.
> 
> > My hadoop-site.xml:
> > [snip]
> 
> Nothing suspect, there. -C

Mime
View raw message