hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yanbo Liang <yanboha...@gmail.com>
Subject Re: Finding mean and median python streaming
Date Tue, 02 Apr 2013 09:14:22 GMT
How many Reducer did you start for this job?
If you start many Reducers for this job, it will produce multiple output
file which named as part-*****.
And each part is only the local mean and median value of the specific
Reducer partition.

Two kinds of solutions:
1, Call the method of setNumReduceTasks(1) to set the Reducer number to 1,
and it will produce only one output file and each distinct key will produce
only one mean and median value.
2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
code. It process all the output file which produced by multiple Reducer by
a local function, and it produce the ultimate result.

BR
Yanbo


2013/4/2 jamal sasha <jamalshasha@gmail.com>

> pinging again.
> Let me rephrase the question.
> If my data is like:
> id, value
>
> And I want to find average "value" for each id, how can i do that using
> hadoop streaming?
> I am sure, it should be very straightforward but aparently my
> understanding of how code works in hadoop streaming is not right.
> I would really appreciate if someone can help me with this query.
> THanks
>
>
>
> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <jamalshasha@gmail.com> wrote:
>
>> data_dict is declared globably as
>> data_dict = defaultdict(list)
>>
>>
>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <jamalshasha@gmail.com>wrote:
>>
>>> Very dumb question..
>>> I have data as following
>>> id1, value
>>> 1, 20.2
>>> 1,20.4
>>> ....
>>>
>>> I want to find the mean and median of id1?
>>> I am using python hadoop streaming..
>>> mapper.py
>>> for line in sys.stdin:
>>> try:
>>> # remove leading and trailing whitespace
>>>  line = line.rstrip(os.linesep)
>>> tokens = line.split(",")
>>>  print '%s,%s' % (tokens[0],tokens[1])
>>> except Exception:
>>> continue
>>>
>>>
>>> reducer.py
>>> def mean(data_list):
>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>> def median(mylist):
>>>     sorts = sorted(mylist)
>>>     length = len(sorts)
>>>     if not length % 2:
>>>         return (sorts[length / 2] + sorts[length / 2 - 1]) / 2.0
>>>     return sorts[length / 2]
>>>
>>>
>>> for line in sys.stdin:
>>> try:
>>> line = line.rstrip(os.linesep)
>>> serial_id, duration = line.split(",")
>>>  data_dict[serial_id].append(float(duration))
>>> except Exception:
>>> pass
>>> for k,v in data_dict.items():
>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>
>>>
>>>
>>> I am expecting a single mean,median to each key
>>> But I see id1 duplicated with different mean and median..
>>> Any suggestions?
>>>
>>
>>
>

Mime
View raw message