hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ravikant Dindokar <ravikant.i...@gmail.com>
Subject Re: Joins in Hadoop
Date Thu, 25 Jun 2015 04:36:07 GMT
but in the reducer for Job1, you have :
vertexId => {adjcencyVertex complete list, partitonid=0}

so partition Id's for vertices in the adjacency list are not available. So
essentially what I am trying to get output as

<vertex_id,partitionId>,<list >
where each element of list is of type <vertex_id,partitionId>

can this be achieved in single map-reduce job?

Thanks
Ravikant




On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com>
wrote:

> yeah you can store it as well in your custom object like you are storing
> adjacency list.
>
> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
> ravikant.iisc@gmail.com> wrote:
>
>> Hi Harshit,
>>
>> Is there any way to retain the partition id for each vertex in the
>> adjacency list?
>>
>>
>> Thanks
>> Ravikant
>>
>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Thanks Harshit
>>>
>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>> This may be the solution (i hope i understood the problem correctly)
>>>>
>>>> Job 1:
>>>>
>>>> You need to  have two Mappers one reading from Edge File and the other
>>>> reading from Partition file.
>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>> following,
>>>> 1)type : a representation of the object coming from which mapper
>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>> 3)partiton Id: to hold the partition id
>>>>
>>>> Now the output key and value of the EdgeFileMapper will be,
>>>> key=> vertexId
>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not be
>>>> present in this file)
>>>>
>>>> The output of PartitionFileMapper will be,
>>>> key=>vertexId
>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>
>>>>
>>>> So in the Reducer for each VertexId we will can have the complete
>>>> GraphCustomObject populated.
>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>
>>>> The output of this reducer will be,
>>>> key=> partitionId
>>>> Value=> {adjcencyVertexList, vertexId}
>>>> This will be the stored as output of job1.
>>>>
>>>> Job 2
>>>> This job will read the output generated in the previous job and use
>>>> identity Mapper, so in the reducer we will have
>>>> key=> partitionId
>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>
>>>>
>>>>
>>>> I know my explanation seems a bit messy, sorry for that.
>>>>
>>>> BR,
>>>> Harshit
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com> wrote:
>>>>
>>>>> Hi Hadoop user,
>>>>>
>>>>> I want to use hadoop for performing operation on graph data
>>>>> I have two file :
>>>>>
>>>>> 1. Edge list file
>>>>>         This file contains one line for each edge in the graph.
>>>>> sample:
>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>> 1    5
>>>>> 2    3
>>>>> 4    2
>>>>> 4    3
>>>>> 5    6
>>>>> 5    4
>>>>> 5    7
>>>>> 7    8
>>>>> 8    9
>>>>> 8    10
>>>>>
>>>>> 2. Partition file :
>>>>>          This file contains one line for each vertex. Each line has
>>>>> two values first number is <vertex id> and second number is <partition
id >
>>>>>  sample : <vertex id>  <partition id >
>>>>> 2    1
>>>>> 3    1
>>>>> 4    1
>>>>> 5    2
>>>>> 6    2
>>>>> 7    2
>>>>> 8    1
>>>>> 9    1
>>>>> 10    1
>>>>>
>>>>>
>>>>> The Edge list file is having size of 32Gb, while partition file is of
>>>>> 10Gb.
>>>>> (size is so large that map/reduce can read only partition file . I
>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>
>>>>> My aim is to get all vertices (along with their adjacency list )those
>>>>> having same partition id in one reducer so that I can perform further
>>>>> analytics on a given partition in reducer.
>>>>>
>>>>> Is there any way in hadoop to get join of these two file in mapper and
>>>>> so that I can map based on the partition id ?
>>>>>
>>>>> Thanks
>>>>> Ravikant
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harshit Mathur
>>>>
>>>
>>>
>>
>
>
> --
> Harshit Mathur
>

Mime
View raw message