hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Madhav Sharan <msha...@usc.edu>
Subject Pairwise similarity using map reduce
Date Wed, 10 Aug 2016 19:25:46 GMT
Hi hadoop users,

I have a set of vectors stored in .txt files on HDFS. Goal is to take every
pair of vector and compute similarity between them.

   1. We generate pairs of vectors by a python script and give it as a
   input to MR jobs. Input file has comma separated path to vector files. "
   */path/to/vec1*, *path/to/vec2*" .
   2. Then mapper tasks gets (Path1, Path2) and computes similarity.

To do this Mapper reads file at Path1 using HDFS API, reads File at Path2
using HDFS API. So, each file is read many many times due to the pairwise
calculation.

I am trying find a way so that I read file only once and my mapper jobs
receive contents of file rather than file path.

Can someone please share any technique they have used in past that might
help?

Thanks
--
Madhav Sharan

Mime
View raw message