hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikant Dindokar <ravikant.i...@gmail.com>
Subject Re: Joins in Hadoop
Date Wed, 24 Jun 2015 16:40:08 GMT
Hi Harshit,

Is there any way to retain the partition id for each vertex in the
adjacency list?


Thanks
Ravikant

On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <ravikant.iisc@gmail.com>
wrote:

> Thanks Harshit
>
> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com>
> wrote:
>
>> Hi,
>>
>>
>> This may be the solution (i hope i understood the problem correctly)
>>
>> Job 1:
>>
>> You need to  have two Mappers one reading from Edge File and the other
>> reading from Partition file.
>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>> Now you can have a custom writable (say GraphCustomObject) holding the
>> following,
>> 1)type : a representation of the object coming from which mapper
>> 2)Adjacency vertex list: list of adjacency vertex
>> 3)partiton Id: to hold the partition id
>>
>> Now the output key and value of the EdgeFileMapper will be,
>> key=> vertexId
>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>> present in this file)
>>
>> The output of PartitionFileMapper will be,
>> key=>vertexId
>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>
>>
>> So in the Reducer for each VertexId we will can have the complete
>> GraphCustomObject populated.
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> The output of this reducer will be,
>> key=> partitionId
>> Value=> {adjcencyVertexList, vertexId}
>> This will be the stored as output of job1.
>>
>> Job 2
>> This job will read the output generated in the previous job and use
>> identity Mapper, so in the reducer we will have
>> key=> partitionId
>> value=> list of all the adjacency vertexlist along with vertexid
>>
>>
>>
>> I know my explanation seems a bit messy, sorry for that.
>>
>> BR,
>> Harshit
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Hadoop user,
>>>
>>> I want to use hadoop for performing operation on graph data
>>> I have two file :
>>>
>>> 1. Edge list file
>>>         This file contains one line for each edge in the graph.
>>> sample:
>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>> 1    5
>>> 2    3
>>> 4    2
>>> 4    3
>>> 5    6
>>> 5    4
>>> 5    7
>>> 7    8
>>> 8    9
>>> 8    10
>>>
>>> 2. Partition file :
>>>          This file contains one line for each vertex. Each line has two
>>> values first number is <vertex id> and second number is <partition id
>
>>>  sample : <vertex id>  <partition id >
>>> 2    1
>>> 3    1
>>> 4    1
>>> 5    2
>>> 6    2
>>> 7    2
>>> 8    1
>>> 9    1
>>> 10    1
>>>
>>>
>>> The Edge list file is having size of 32Gb, while partition file is of
>>> 10Gb.
>>> (size is so large that map/reduce can read only partition file . I have
>>> 20 node cluster with 24Gb memory per node.)
>>>
>>> My aim is to get all vertices (along with their adjacency list )those
>>> having same partition id in one reducer so that I can perform further
>>> analytics on a given partition in reducer.
>>>
>>> Is there any way in hadoop to get join of these two file in mapper and
>>> so that I can map based on the partition id ?
>>>
>>> Thanks
>>> Ravikant
>>>
>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>

Mime
View raw message