hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashutosh Chauhan <ashutosh.chau...@gmail.com>
Subject Re: cross product of two files using MapReduce - pls suggest
Date Wed, 19 Jan 2011 18:53:11 GMT
Pig has a built-in CROSS operator.

 grunt> a = load 'file1';
 grunt> b = load 'file2';
 grunt> c = cross a,b;
 grunt> store c into 'file3';

 Ashutosh

> On Wed, Jan 19, 2011 at 03:35, Rohit Kelkar <rohitkelkar@gmail.com> wrote:
>> I have two files, A and D, containing (vectorId, vector) on each line.
>> |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
>>
>> Now I want to execute the following
>>
>> for eachItem in A:
>>    for eachElem in D:
>>        dot_product = eachItem * eachElem
>>        save(dot_product)
>>
>>
>> What I tried was to convert file D in to a MapFile in (key = vectorId,
>> value = vector) format and set up a hadoop job such that,
>> inputFile = A
>> inputFileFormat = NLineInputFormat
>>
>> pseudo code for the map function:
>>
>> map(key=vectorid, value=myVector):
>>    open(MapFile containing all vectors of D)
>>    for eachElem in MapFile:
>>        dot_product = myVector * eachElem
>>        context.write(dot_product)
>>    close(MapFile containing all vectors of D)
>>
>>
>> I was expecting that sequentially accessing the MapFile would be much
>> faster. When I took some stats on a single node with a smaller dataset
>> where |A| = 100 and |D| = 100,000 what I observed was that
>> total time taken to iterate over the MapFile = 738 secs
>> total time taken to compute the dot_product = 11 sec
>>
>> My original intention to speed up the process using MapReduce is
>> defeated because of the io time involved in accessing each entry in
>> the MapFile. Are there any other avenues that I could explore?
>>
>

Mime
View raw message