Pig has a builtin CROSS operator.
grunt> a = load 'file1';
grunt> b = load 'file2';
grunt> c = cross a,b;
grunt> store c into 'file3';
Ashutosh
> On Wed, Jan 19, 2011 at 03:35, Rohit Kelkar <rohitkelkar@gmail.com> wrote:
>> I have two files, A and D, containing (vectorId, vector) on each line.
>> D = 100,000 and A = 1000. Dimensionality of the vectors = 100
>>
>> Now I want to execute the following
>>
>> for eachItem in A:
>> for eachElem in D:
>> dot_product = eachItem * eachElem
>> save(dot_product)
>>
>>
>> What I tried was to convert file D in to a MapFile in (key = vectorId,
>> value = vector) format and set up a hadoop job such that,
>> inputFile = A
>> inputFileFormat = NLineInputFormat
>>
>> pseudo code for the map function:
>>
>> map(key=vectorid, value=myVector):
>> open(MapFile containing all vectors of D)
>> for eachElem in MapFile:
>> dot_product = myVector * eachElem
>> context.write(dot_product)
>> close(MapFile containing all vectors of D)
>>
>>
>> I was expecting that sequentially accessing the MapFile would be much
>> faster. When I took some stats on a single node with a smaller dataset
>> where A = 100 and D = 100,000 what I observed was that
>> total time taken to iterate over the MapFile = 738 secs
>> total time taken to compute the dot_product = 11 sec
>>
>> My original intention to speed up the process using MapReduce is
>> defeated because of the io time involved in accessing each entry in
>> the MapFile. Are there any other avenues that I could explore?
>>
>
