hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikant Dindokar <ravikant.i...@gmail.com>
Subject Joins in Hadoop
Date Wed, 24 Jun 2015 06:35:32 GMT
Hi Hadoop user,

I want to use hadoop for performing operation on graph data
I have two file :

1. Edge list file
        This file contains one line for each edge in the graph.
sample:
1    2 (here 1 is source and 2 is sink node for the edge)
1    5
2    3
4    2
4    3
5    6
5    4
5    7
7    8
8    9
8    10

2. Partition file :
         This file contains one line for each vertex. Each line has two
values first number is <vertex id> and second number is <partition id >
 sample : <vertex id>  <partition id >
2    1
3    1
4    1
5    2
6    2
7    2
8    1
9    1
10    1


The Edge list file is having size of 32Gb, while partition file is of 10Gb.
(size is so large that map/reduce can read only partition file . I have 20
node cluster with 24Gb memory per node.)

My aim is to get all vertices (along with their adjacency list )those
having same partition id in one reducer so that I can perform further
analytics on a given partition in reducer.

Is there any way in hadoop to get join of these two file in mapper and so
that I can map based on the partition id ?

Thanks
Ravikant

Mime
View raw message