hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ross Boucher <bouc...@apple.com>
Subject Re: Running Custom Job
Date Thu, 20 Sep 2007 17:04:02 GMT
Since I was doing terms instead of just words, it was slightly more  
complicated then just calling sort.
A simple solution might have been to reverse the order in which the  
hadoop job outputs results, i.e.

value key    instead of      key 	value

But instead, I used the following set of commands to process the file  
and sort it appropriately:

cat * | tr '\t' ' ' | sed 's/^\(.*\) \([0-9]*\)$/\2 \1/' | sort -n >  

Which basically takes the number at the end of each line, and puts it  
at the beginning, then calls sort.
Works like a charm.  Thanks for the help.

On Sep 19, 2007, at 4:03 PM, Ted Dunning wrote:

> I use something like this:
>   bin/hadoop -getmerge <output-directory> | sort +1n
> This wokrs very well because the final counts are relatively small  
> compared
> to the original input.  There is nothing that says you can't mix MR
> programming with conventional code.
> On 9/19/07 3:42 PM, "Ross Boucher" <boucher@apple.com> wrote:
>> This problem seems to have gone away by itself.
>> Now I have my job running, but I'm not entirely sure how to get the
>> output into something useful to me.
>> I've counting word frequencies, and I would like the output sorted by
>> frequency, rather than alphabetically.  I would also like the final
>> output to be in one file, though I'm not sure if this is possible
>> given that its computed separately.  I suppose it wouldn't be too
>> difficult to post process the files to get them sorted the way I
>> would like and in one file, but if anyone has some tips on how to do
>> this in my job itself, that would be great.
>> Thanks.
>> Ross Boucher
>> boucher@apple.com
>> On Sep 19, 2007, at 2:59 PM, Owen O'Malley wrote:
>>> On Sep 19, 2007, at 2:30 PM, Ross Boucher wrote:
>>>> Specifically, the job starts, and then each task that is scheduled
>>>> fails, with the following error:
>>>> Error initializing task_0007_m_000063_0:
>>>> java.io.IOException: /DFS_ROOT/tmp/mapred/system/submit_i849v1/
>>>> job.xml: No such file or directory
>>> Look at the configuration of your mapred.system.dir. It MUST be the
>>> same on both the cluster and submitting node. Note that
>>> mapred.system.dir must be in the default file system, which must
>>> also be the same on the cluster and submitting node. Note that
>>> there is a jira (HADOOP-1100) that would have the cluster pass the
>>> system directory to the client, which would get rid of this issue.
>>> -- Owen

View raw message