hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harshit Mathur <mathursh...@gmail.com>
Subject Re: Joins in Hadoop
Date Thu, 25 Jun 2015 03:55:13 GMT
yeah you can store it as well in your custom object like you are storing
adjacency list.

On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> Hi Harshit,
>
> Is there any way to retain the partition id for each vertex in the
> adjacency list?
>
>
> Thanks
> Ravikant
>
> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Thanks Harshit
>>
>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>>
>>> This may be the solution (i hope i understood the problem correctly)
>>>
>>> Job 1:
>>>
>>> You need to  have two Mappers one reading from Edge File and the other
>>> reading from Partition file.
>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>> following,
>>> 1)type : a representation of the object coming from which mapper
>>> 2)Adjacency vertex list: list of adjacency vertex
>>> 3)partiton Id: to hold the partition id
>>>
>>> Now the output key and value of the EdgeFileMapper will be,
>>> key=> vertexId
>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>> present in this file)
>>>
>>> The output of PartitionFileMapper will be,
>>> key=>vertexId
>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>
>>>
>>> So in the Reducer for each VertexId we will can have the complete
>>> GraphCustomObject populated.
>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>
>>> The output of this reducer will be,
>>> key=> partitionId
>>> Value=> {adjcencyVertexList, vertexId}
>>> This will be the stored as output of job1.
>>>
>>> Job 2
>>> This job will read the output generated in the previous job and use
>>> identity Mapper, so in the reducer we will have
>>> key=> partitionId
>>> value=> list of all the adjacency vertexlist along with vertexid
>>>
>>>
>>>
>>> I know my explanation seems a bit messy, sorry for that.
>>>
>>> BR,
>>> Harshit
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Hi Hadoop user,
>>>>
>>>> I want to use hadoop for performing operation on graph data
>>>> I have two file :
>>>>
>>>> 1. Edge list file
>>>>         This file contains one line for each edge in the graph.
>>>> sample:
>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>> 1    5
>>>> 2    3
>>>> 4    2
>>>> 4    3
>>>> 5    6
>>>> 5    4
>>>> 5    7
>>>> 7    8
>>>> 8    9
>>>> 8    10
>>>>
>>>> 2. Partition file :
>>>>          This file contains one line for each vertex. Each line has two
>>>> values first number is <vertex id> and second number is <partition
id >
>>>>  sample : <vertex id>  <partition id >
>>>> 2    1
>>>> 3    1
>>>> 4    1
>>>> 5    2
>>>> 6    2
>>>> 7    2
>>>> 8    1
>>>> 9    1
>>>> 10    1
>>>>
>>>>
>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>> 10Gb.
>>>> (size is so large that map/reduce can read only partition file . I have
>>>> 20 node cluster with 24Gb memory per node.)
>>>>
>>>> My aim is to get all vertices (along with their adjacency list )those
>>>> having same partition id in one reducer so that I can perform further
>>>> analytics on a given partition in reducer.
>>>>
>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>> so that I can map based on the partition id ?
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>


-- 
Harshit Mathur

Mime
View raw message