hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason <urg...@gmail.com>
Subject Re: cross product of two files using MapReduce - pls suggest
Date Wed, 19 Jan 2011 16:09:56 GMT
I am afraid that by reading an hdfs file manually in your mapper, you are loosing data locality.
You can try putting smaller vectors into distributed cache and preload them all in memory
in the mapper setup. This implies that they can fit in memory and also that you can change
your m/r to run over the larger vector set as an input.

Sent from my iPhone

On Jan 19, 2011, at 3:35 AM, Rohit Kelkar <rohitkelkar@gmail.com> wrote:

> I have two files, A and D, containing (vectorId, vector) on each line.
> |D| = 100,000 and |A| = 1000. Dimensionality of the vectors = 100
> 
> Now I want to execute the following
> 
> for eachItem in A:
>    for eachElem in D:
>        dot_product = eachItem * eachElem
>        save(dot_product)
> 
> 
> What I tried was to convert file D in to a MapFile in (key = vectorId,
> value = vector) format and set up a hadoop job such that,
> inputFile = A
> inputFileFormat = NLineInputFormat
> 
> pseudo code for the map function:
> 
> map(key=vectorid, value=myVector):
>    open(MapFile containing all vectors of D)
>    for eachElem in MapFile:
>        dot_product = myVector * eachElem
>        context.write(dot_product)
>    close(MapFile containing all vectors of D)
> 
> 
> I was expecting that sequentially accessing the MapFile would be much
> faster. When I took some stats on a single node with a smaller dataset
> where |A| = 100 and |D| = 100,000 what I observed was that
> total time taken to iterate over the MapFile = 738 secs
> total time taken to compute the dot_product = 11 sec
> 
> My original intention to speed up the process using MapReduce is
> defeated because of the io time involved in accessing each entry in
> the MapFile. Are there any other avenues that I could explore?

Mime
View raw message