Pig has a builtin CROSS operator.
grunt> a = load 'file1';
grunt> b = load 'file2';
grunt> c = cross a,b;
grunt> store c into 'file3';
>> I have two files, A and D, containing (vectorId, vector) on each line.
>> D = 100,000 and A = 1000. Dimensionality of the vectors = 100
>> Now I want to execute the following
>> for eachItem in A:
>> for eachElem in D:
>> dot_product = eachItem * eachElem
>> save(dot_product)
>> What I tried was to convert file D in to a MapFile in (key = vectorId,
>> value = vector) format and set up a hadoop job such that,
>> inputFile = A
>> inputFileFormat = NLineInputFormat
>> pseudo code for the map function:
>>
>> map(key=vectorid, value=myVector):
>> open(MapFile containing all vectors of D)
>> for eachElem in MapFile:
>> dot_product = myVector * eachElem
>> context.write(dot_product)
>> close(MapFile containing all vectors of D)
>> I was expecting that sequentially accessing the MapFile would be much
>> faster. When I took some stats on a single node with a smaller dataset
>> where A = 100 and D = 100,000 what I observed was that
>> total time taken to iterate over the MapFile = 738 secs
>> total time taken to compute the dot_product = 11 sec
>> My original intention to speed up the process using MapReduce is
>> defeated because of the io time involved in accessing each entry in
>> the MapFile. Are there any other avenues that I could explore?
