hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: Stackoverflow
Date Tue, 03 Jun 2008 19:01:09 GMT
On Tuesday 03 June 2008 20:35:03 Chris Douglas wrote:
> >> By "not exactly small, do you mean each line is long or that there
> >> are many records?
> >
> > Well, not small in the meaning, that even I could get my boss to
> > allow me to
> > give you the data, transfering it might be painful. (E.g. the job that
> > aborted had about 12M lines with with ~2.6GB data => the lines are
> > not really
> > long, but longer than 80 chars)
> Ah, I see. Would it be possible to run the Java sort example over
> your data? It would be helpful to verify that this is not specific to
> streaming.
> ${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
>    -r 88 \
>    -inFormat org.apache.hadoop.mapred.TextInputFormat \
>    -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
>    -outKey org.apache.hadoop.io.LongWritable \
>    -outValue org.apache.hadoop.io.Text \
>    <input dir> <ouput dir (ignored)>
> This should be close to streaming with cat as the mapper.
> >> util.QuickSort is only used on the map side, so this shouldn't have
> >> anything to do with the reduce. Is it always and only the *last* map
> >
> > Nope, although sometimes it happens earlier.
> Is it always the same splits when you re-run your job? Though
> distributing the full dataset may not be feasible, if there are
> splits that fail consistently then we might be able to work from that.

Who decides the splits? (If it's deterministic, it should be the same)

Well, I'm currently trying out if rerunning the job with same command on the 
same data reproduces the bug.

After that, I'll try your above proposed command.

And after that, I'll try and see if I can manage to produce a simpler data set 
to reproduce it.

> >> task that fails? If I sent you a patch that would print a trace with
> >> the partitions, would you mind running it? Do you have any other
> >> settings that differ from the defaults? -C
> >
> > If you tell me how to apply it, I'm happy to. (I'm not the biggest
> > Java
> > hotshot on this planet, I'm just using the provided 0.17.0 jars,
> > Guess I
> > would have to patch the source and run ant. On all nodes or just the
> > control?).
> Unfortunately, it would need to be deployed to all the TaskTrackers,
Well, that's not the biggest problem, I need to deploy my Python stuff to all 
nodes too, guess one item more for the big rsync run.

> and it would be pretty invasive (i.e. I was planning on logging all
> the offsets from the sort as the stack unwinds from the exception).
> I'll test something and send it to you, and if it's not too much
> trouble you can try it.

Happy to. Might be that I'll do it tomorrow, so I have longer to observe and 
revert if anything unhappy happens. (I only have this one production cluster, 
and it needs to continue munching production data).


View raw message