hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikant Dindokar <ravikant.i...@gmail.com>
Subject Re: Joins in Hadoop
Date Wed, 24 Jun 2015 14:25:52 GMT
Thanks Harshit

On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com>
wrote:

> Hi,
>
>
> This may be the solution (i hope i understood the problem correctly)
>
> Job 1:
>
> You need to  have two Mappers one reading from Edge File and the other
> reading from Partition file.
> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
> Now you can have a custom writable (say GraphCustomObject) holding the
> following,
> 1)type : a representation of the object coming from which mapper
> 2)Adjacency vertex list: list of adjacency vertex
> 3)partiton Id: to hold the partition id
>
> Now the output key and value of the EdgeFileMapper will be,
> key=> vertexId
> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
> present in this file)
>
> The output of PartitionFileMapper will be,
> key=>vertexId
> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>
>
> So in the Reducer for each VertexId we will can have the complete
> GraphCustomObject populated.
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> The output of this reducer will be,
> key=> partitionId
> Value=> {adjcencyVertexList, vertexId}
> This will be the stored as output of job1.
>
> Job 2
> This job will read the output generated in the previous job and use
> identity Mapper, so in the reducer we will have
> key=> partitionId
> value=> list of all the adjacency vertexlist along with vertexid
>
>
>
> I know my explanation seems a bit messy, sorry for that.
>
> BR,
> Harshit
>
>
>
>
>
>
>
>
> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Hadoop user,
>>
>> I want to use hadoop for performing operation on graph data
>> I have two file :
>>
>> 1. Edge list file
>>         This file contains one line for each edge in the graph.
>> sample:
>> 1    2 (here 1 is source and 2 is sink node for the edge)
>> 1    5
>> 2    3
>> 4    2
>> 4    3
>> 5    6
>> 5    4
>> 5    7
>> 7    8
>> 8    9
>> 8    10
>>
>> 2. Partition file :
>>          This file contains one line for each vertex. Each line has two
>> values first number is <vertex id> and second number is <partition id >
>>  sample : <vertex id>  <partition id >
>> 2    1
>> 3    1
>> 4    1
>> 5    2
>> 6    2
>> 7    2
>> 8    1
>> 9    1
>> 10    1
>>
>>
>> The Edge list file is having size of 32Gb, while partition file is of
>> 10Gb.
>> (size is so large that map/reduce can read only partition file . I have
>> 20 node cluster with 24Gb memory per node.)
>>
>> My aim is to get all vertices (along with their adjacency list )those
>> having same partition id in one reducer so that I can perform further
>> analytics on a given partition in reducer.
>>
>> Is there any way in hadoop to get join of these two file in mapper and so
>> that I can map based on the partition id ?
>>
>> Thanks
>> Ravikant
>>
>
>
>
> --
> Harshit Mathur
>

Mime
View raw message