hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harshit Mathur <mathursh...@gmail.com>
Subject Re: Joins in Hadoop
Date Thu, 25 Jun 2015 05:10:19 GMT
So basically you want <vertex_id,partitionId> as your key..?
If this is the case, then you can have your custom key object by
implementing writablecomparable.

But i am not sure if the logic permits to do this in this single map reduce
job. As per my understanding of your problem, what you want to achieve will
be done in two jobs.

On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <ravikant.iisc@gmail.com
> wrote:

> but in the reducer for Job1, you have :
> vertexId => {adjcencyVertex complete list, partitonid=0}
>
> so partition Id's for vertices in the adjacency list are not available. So
> essentially what I am trying to get output as
>
> <vertex_id,partitionId>,<list >
> where each element of list is of type <vertex_id,partitionId>
>
> can this be achieved in single map-reduce job?
>
> Thanks
> Ravikant
>
>
>
>
> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com>
> wrote:
>
>> yeah you can store it as well in your custom object like you are storing
>> adjacency list.
>>
>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>> ravikant.iisc@gmail.com> wrote:
>>
>>> Hi Harshit,
>>>
>>> Is there any way to retain the partition id for each vertex in the
>>> adjacency list?
>>>
>>>
>>> Thanks
>>> Ravikant
>>>
>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com> wrote:
>>>
>>>> Thanks Harshit
>>>>
>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>
>>>>> Job 1:
>>>>>
>>>>> You need to  have two Mappers one reading from Edge File and the other
>>>>> reading from Partition file.
>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>> Now you can have a custom writable (say GraphCustomObject) holding the
>>>>> following,
>>>>> 1)type : a representation of the object coming from which mapper
>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>> 3)partiton Id: to hold the partition id
>>>>>
>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>> key=> vertexId
>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will not
be
>>>>> present in this file)
>>>>>
>>>>> The output of PartitionFileMapper will be,
>>>>> key=>vertexId
>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>
>>>>>
>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>> GraphCustomObject populated.
>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>
>>>>> The output of this reducer will be,
>>>>> key=> partitionId
>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>> This will be the stored as output of job1.
>>>>>
>>>>> Job 2
>>>>> This job will read the output generated in the previous job and use
>>>>> identity Mapper, so in the reducer we will have
>>>>> key=> partitionId
>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>
>>>>>
>>>>>
>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>
>>>>> BR,
>>>>> Harshit
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>> ravikant.iisc@gmail.com> wrote:
>>>>>
>>>>>> Hi Hadoop user,
>>>>>>
>>>>>> I want to use hadoop for performing operation on graph data
>>>>>> I have two file :
>>>>>>
>>>>>> 1. Edge list file
>>>>>>         This file contains one line for each edge in the graph.
>>>>>> sample:
>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>> 1    5
>>>>>> 2    3
>>>>>> 4    2
>>>>>> 4    3
>>>>>> 5    6
>>>>>> 5    4
>>>>>> 5    7
>>>>>> 7    8
>>>>>> 8    9
>>>>>> 8    10
>>>>>>
>>>>>> 2. Partition file :
>>>>>>          This file contains one line for each vertex. Each line has
>>>>>> two values first number is <vertex id> and second number is
<partition id >
>>>>>>  sample : <vertex id>  <partition id >
>>>>>> 2    1
>>>>>> 3    1
>>>>>> 4    1
>>>>>> 5    2
>>>>>> 6    2
>>>>>> 7    2
>>>>>> 8    1
>>>>>> 9    1
>>>>>> 10    1
>>>>>>
>>>>>>
>>>>>> The Edge list file is having size of 32Gb, while partition file is
of
>>>>>> 10Gb.
>>>>>> (size is so large that map/reduce can read only partition file .
I
>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>
>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>> )those  having same partition id in one reducer so that I can perform
>>>>>> further analytics on a given partition in reducer.
>>>>>>
>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>> and so that I can map based on the partition id ?
>>>>>>
>>>>>> Thanks
>>>>>> Ravikant
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Harshit Mathur
>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Harshit Mathur
>>
>
>


-- 
Harshit Mathur

Mime
View raw message