Hi,
I have a question.
Why do I need to set number of reducers to 1.
Since all the keys are sorted shouldnt most of the keys go to same reducer?
On Tue, Apr 2, 2013 at 2:14 AM, Yanbo Liang <yanbohappy@gmail.com> wrote:
> How many Reducer did you start for this job?
> If you start many Reducers for this job, it will produce multiple output
> file which named as part*****.
> And each part is only the local mean and median value of the specific
> Reducer partition.
>
> Two kinds of solutions:
> 1, Call the method of setNumReduceTasks(1) to set the Reducer number to
> 1, and it will produce only one output file and each distinct key will
> produce only one mean and median value.
> 2, Reference the org.apache.hadoop.examples.WordMedian in Hadoop source
> code. It process all the output file which produced by multiple Reducer by
> a local function, and it produce the ultimate result.
>
> BR
> Yanbo
>
>
> 2013/4/2 jamal sasha <jamalshasha@gmail.com>
>
>> pinging again.
>> Let me rephrase the question.
>> If my data is like:
>> id, value
>>
>> And I want to find average "value" for each id, how can i do that using
>> hadoop streaming?
>> I am sure, it should be very straightforward but aparently my
>> understanding of how code works in hadoop streaming is not right.
>> I would really appreciate if someone can help me with this query.
>> THanks
>>
>>
>>
>> On Mon, Apr 1, 2013 at 2:27 PM, jamal sasha <jamalshasha@gmail.com>wrote:
>>
>>> data_dict is declared globably as
>>> data_dict = defaultdict(list)
>>>
>>>
>>> On Mon, Apr 1, 2013 at 2:25 PM, jamal sasha <jamalshasha@gmail.com>wrote:
>>>
>>>> Very dumb question..
>>>> I have data as following
>>>> id1, value
>>>> 1, 20.2
>>>> 1,20.4
>>>> ....
>>>>
>>>> I want to find the mean and median of id1?
>>>> I am using python hadoop streaming..
>>>> mapper.py
>>>> for line in sys.stdin:
>>>> try:
>>>> # remove leading and trailing whitespace
>>>> line = line.rstrip(os.linesep)
>>>> tokens = line.split(",")
>>>> print '%s,%s' % (tokens[0],tokens[1])
>>>> except Exception:
>>>> continue
>>>>
>>>>
>>>> reducer.py
>>>> def mean(data_list):
>>>> return sum(data_list)/float(len(data_list)) if len(data_list) else 0
>>>> def median(mylist):
>>>> sorts = sorted(mylist)
>>>> length = len(sorts)
>>>> if not length % 2:
>>>> return (sorts[length / 2] + sorts[length / 2  1]) / 2.0
>>>> return sorts[length / 2]
>>>>
>>>>
>>>> for line in sys.stdin:
>>>> try:
>>>> line = line.rstrip(os.linesep)
>>>> serial_id, duration = line.split(",")
>>>> data_dict[serial_id].append(float(duration))
>>>> except Exception:
>>>> pass
>>>> for k,v in data_dict.items():
>>>> print "%s,%s,%s" %(k, mean(v), median(v))
>>>>
>>>>
>>>>
>>>> I am expecting a single mean,median to each key
>>>> But I see id1 duplicated with different mean and median..
>>>> Any suggestions?
>>>>
>>>
>>>
>>
>
