hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Douglas <chri...@yahoo-inc.com>
Subject Re: Stackoverflow
Date Tue, 03 Jun 2008 20:04:44 GMT
Ah; you're right, of course. Sorry about that. -C

On Jun 3, 2008, at 12:00 PM, Runping Qi wrote:

>
> Chris,
>
> Your version will use LongWritable as the map output key type, which
> changes the job nature completely. You should use
> ${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
>>    -r 88 \
>>    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
>>    -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
>>    -outKey org.apache.hadoop.io.Text \
>>    -outValue org.apache.hadoop.io.Text \
>>    <input dir> <ouput dir (ignored)>
> instead.
>
> Runping
>
>> -----Original Message-----
>> From: Chris Douglas [mailto:chrisdo@yahoo-inc.com]
>> Sent: Tuesday, June 03, 2008 11:35 AM
>> To: core-user@hadoop.apache.org
>> Subject: Re: Stackoverflow
>>
>>>> By "not exactly small, do you mean each line is long or that there
>>>> are many records?
>>>
>>> Well, not small in the meaning, that even I could get my boss to
>>> allow me to
>>> give you the data, transfering it might be painful. (E.g. the job
> that
>>> aborted had about 12M lines with with ~2.6GB data => the lines are
>>> not really
>>> long, but longer than 80 chars)
>>
>> Ah, I see. Would it be possible to run the Java sort example over
>> your data? It would be helpful to verify that this is not specific to
>> streaming.
>>
>> ${hadoop} jar hadoop-0.17-examples.jar sort -m <num maps> \
>>    -r 88 \
>>    -inFormat org.apache.hadoop.mapred.TextInputFormat \
>>    -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \
>>    -outKey org.apache.hadoop.io.LongWritable \
>>    -outValue org.apache.hadoop.io.Text \
>>    <input dir> <ouput dir (ignored)>
>>
>> This should be close to streaming with cat as the mapper.
>>
>>>> util.QuickSort is only used on the map side, so this shouldn't have
>>>> anything to do with the reduce. Is it always and only the *last*
> map
>>>
>>> Nope, although sometimes it happens earlier.
>>
>> Is it always the same splits when you re-run your job? Though
>> distributing the full dataset may not be feasible, if there are
>> splits that fail consistently then we might be able to work from  
>> that.
>>
>>>> task that fails? If I sent you a patch that would print a trace
> with
>>>> the partitions, would you mind running it? Do you have any other
>>>> settings that differ from the defaults? -C
>>>
>>> If you tell me how to apply it, I'm happy to. (I'm not the biggest
>>> Java
>>> hotshot on this planet, I'm just using the provided 0.17.0 jars,
>>> Guess I
>>> would have to patch the source and run ant. On all nodes or just the
>>> control?).
>>
>> Unfortunately, it would need to be deployed to all the TaskTrackers,
>> and it would be pretty invasive (i.e. I was planning on logging all
>> the offsets from the sort as the stack unwinds from the exception).
>> I'll test something and send it to you, and if it's not too much
>> trouble you can try it.
>>
>>> My hadoop-site.xml:
>>> [snip]
>>
>> Nothing suspect, there. -C


Mime
View raw message