pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Winkler, Robert (Civ, ARL/CISD)" <robert.wink...@us.army.mil>
Subject RE: Reducers slowing down? (UNCLASSIFIED)
Date Mon, 08 Mar 2010 19:58:47 GMT
Classification: UNCLASSIFIED
Caveats: NONE

The Hadoop Administration Page indicated that it was running 30 reducers for
the CROSS but it only had completed 17 as of this morning with 13 still
pending (which is where it was on Friday). I've upped it to 300 reducers
(which is one less than the number of map tasks). I'll see what that does.


Thanks!
-----Original Message-----
From: Mridul Muralidharan [mailto:mridulm@yahoo-inc.com] 
Sent: Friday, March 05, 2010 9:39 PM
To: pig-user@hadoop.apache.org
Cc: Thejas Nair; Winkler, Robert (Civ, ARL/CISD)
Subject: Re: Reducers slowing down? (UNCLASSIFIED)

On Saturday 06 March 2010 04:47 AM, Thejas Nair wrote:
> I am not sure why the rate at which output is generated is slowing down.
> But cross in pig is not optimized ­ it uses only one reducer. (a major
> limitation if you are trying to process lots of data with a large
cluster!)


CROSS is not supposed to use a single reducer - GRCross is parallel in 
pig, last time we checked (a while back though).
It is parallel does not mean it is not expensive, it is still pretty 
darn expensive.

Given this, the next might not work ?


Robert, what about using a higher value of PARALLEL for CROSS ? (much 
higher than number of nodes, if required).

Regards,
Mridul

>
> You can try using skewed join instead ­ project a constant in both streams
> and then join on that.
>
>
> ToCompare = join Actors by 1, People by 1 using Œskewed¹ PARALLEL 30;
>
> I haven¹t tried this on very large dataset, I am interested knowing in how
> this compares if you try it out.
>
> -Thejas
>
>
>
>
> On 3/5/10 9:48 AM, "Winkler, Robert  (Civ, ARL/CISD)"
> <robert.winkler@us.army.mil>  wrote:
>
>> Classification: UNCLASSIFIED
>>
>> Caveats: NONE
>>
>> Hello, I¹m using pig0.6.0 running the following script on a 27 datanode
>> cluster running RedHat Enterprise 5.4:
>>
>>   -- Holds the Pig UDF wrapper around the SecondString SoftTFIDF function
>>
>> REGISTER /home/CandidateIdentification.jar;
>>
>> -- SecondString itself
>>
>> REGISTER /home/secondstring-20060615.jar;
>>
>> -- |People| ~ 62,500,000 from the English GigaWord 4th Edition
>>
>> People = LOAD '/data/UniquePeoplePerStory' USING PigStorage(',') AS
>> (file:chararray, name:chararray);
>>
>> -- |Actors| ~ 8,000 from the Stanford Movie Database
>>
>> Actors = LOAD '/data/Actors' USING PigStorage(',') AS (actor:chararray);
>>
>> -- |ToCompare| ~ 500,000,000,000
>>
>> ToCompare = CROSS Actors, People PARALLEL 30;
>>
>>
>>
>> -- Score 'em and store 'em
>>
>> Results = FOREACH ToCompare GENERATE $0, $1, $2,
>> ARL.CandidateIdentificationUDF.Similarity($2, $0);
>>
>> STORE Results INTO '/data/ScoredPeople' USING PigStorage(',');
>>
>> The first 100,000,000,000 reduce output records were produced in some 25
>> hours. But after 75 hours it has produced a total of 140,000,000,000
(instead
>> of the 300,000,000,000 I was extrapolating) and seems to be producing
them at
>> a slower and slower rate. What is going on? Did I screw something up?
>>
>> Thanks,
>>
>> Robert
>>
>> Classification: UNCLASSIFIED
>>
>> Caveats: NONE
>>
>>
>
>

Classification: UNCLASSIFIED
Caveats: NONE



Mime
View raw message