hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Jurney <russell.jur...@gmail.com>
Subject Re: Joins in Hadoop
Date Thu, 25 Jun 2015 07:01:54 GMT
You are insane to do this with mapreduce. Use Pig or Hive, or Spark and
perform a join. This will take you less than ten minutes, including the
time to download and install pig or hive and run them on your data. For
example, see http://pig.apache.org/docs/r0.15.0/basic.html#join-inner

For curiosity's sake, check out this join implementation in Python:
https://github.com/bd4c/big_data_for_chimps-code/blob/master/examples/ch_07/join.py

And this book, which explains mapreduce joins:
https://books.google.com/books?id=GxFYuVZHG60C&lpg=PP1&dq=Mapreduce%20algorithms&pg=PA59#v=snippet&q=3.5%20Relational%20joins&f=false

Using Java and mapreduce Apis to solve this problem is an exercise in pure
futility. Unless you're doing this to learn, in which case these links
should help.

My book (with Flip Kromer), Big Data for Chimps, covers joins in Pig and
Python:
https://github.com/infochimps-labs/big_data_for_chimps/blob/master/Ch07-joining_patterns.asciidoc

It is due out in a few weeks.

On Wednesday, June 24, 2015, Harshit Mathur <mathursharp@gmail.com> wrote:

> So basically you want <vertex_id,partitionId> as your key..?
> If this is the case, then you can have your custom key object by
> implementing writablecomparable.
>
> But i am not sure if the logic permits to do this in this single map
> reduce job. As per my understanding of your problem, what you want to
> achieve will be done in two jobs.
>
> On Thu, Jun 25, 2015 at 10:06 AM, Ravikant Dindokar <
> ravikant.iisc@gmail.com
> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>
>> but in the reducer for Job1, you have :
>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>
>> so partition Id's for vertices in the adjacency list are not available.
>> So essentially what I am trying to get output as
>>
>> <vertex_id,partitionId>,<list >
>> where each element of list is of type <vertex_id,partitionId>
>>
>> can this be achieved in single map-reduce job?
>>
>> Thanks
>> Ravikant
>>
>>
>>
>>
>> On Thu, Jun 25, 2015 at 9:25 AM, Harshit Mathur <mathursharp@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>
>>> yeah you can store it as well in your custom object like you are storing
>>> adjacency list.
>>>
>>> On Wed, Jun 24, 2015 at 10:10 PM, Ravikant Dindokar <
>>> ravikant.iisc@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>
>>>> Hi Harshit,
>>>>
>>>> Is there any way to retain the partition id for each vertex in the
>>>> adjacency list?
>>>>
>>>>
>>>> Thanks
>>>> Ravikant
>>>>
>>>> On Wed, Jun 24, 2015 at 7:55 PM, Ravikant Dindokar <
>>>> ravikant.iisc@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>> wrote:
>>>>
>>>>> Thanks Harshit
>>>>>
>>>>> On Wed, Jun 24, 2015 at 5:35 PM, Harshit Mathur <mathursharp@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','mathursharp@gmail.com');>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> This may be the solution (i hope i understood the problem correctly)
>>>>>>
>>>>>> Job 1:
>>>>>>
>>>>>> You need to  have two Mappers one reading from Edge File and the
>>>>>> other reading from Partition file.
>>>>>> Say, EdgeFileMapper and PartitionFileMapper, and a common Reducer.
>>>>>> Now you can have a custom writable (say GraphCustomObject) holding
>>>>>> the following,
>>>>>> 1)type : a representation of the object coming from which mapper
>>>>>> 2)Adjacency vertex list: list of adjacency vertex
>>>>>> 3)partiton Id: to hold the partition id
>>>>>>
>>>>>> Now the output key and value of the EdgeFileMapper will be,
>>>>>> key=> vertexId
>>>>>> value=> {type=edgefile; adjcencyVertex, partitonid=0(this will
not be
>>>>>> present in this file)
>>>>>>
>>>>>> The output of PartitionFileMapper will be,
>>>>>> key=>vertexId
>>>>>> value=>{type=partitionfile; adjcencyVertex=0, partitonid)
>>>>>>
>>>>>>
>>>>>> So in the Reducer for each VertexId we will can have the complete
>>>>>> GraphCustomObject populated.
>>>>>> vertexId => {adjcencyVertex complete list, partitonid=0}
>>>>>>
>>>>>> The output of this reducer will be,
>>>>>> key=> partitionId
>>>>>> Value=> {adjcencyVertexList, vertexId}
>>>>>> This will be the stored as output of job1.
>>>>>>
>>>>>> Job 2
>>>>>> This job will read the output generated in the previous job and use
>>>>>> identity Mapper, so in the reducer we will have
>>>>>> key=> partitionId
>>>>>> value=> list of all the adjacency vertexlist along with vertexid
>>>>>>
>>>>>>
>>>>>>
>>>>>> I know my explanation seems a bit messy, sorry for that.
>>>>>>
>>>>>> BR,
>>>>>> Harshit
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Jun 24, 2015 at 12:05 PM, Ravikant Dindokar <
>>>>>> ravikant.iisc@gmail.com
>>>>>> <javascript:_e(%7B%7D,'cvml','ravikant.iisc@gmail.com');>>
wrote:
>>>>>>
>>>>>>> Hi Hadoop user,
>>>>>>>
>>>>>>> I want to use hadoop for performing operation on graph data
>>>>>>> I have two file :
>>>>>>>
>>>>>>> 1. Edge list file
>>>>>>>         This file contains one line for each edge in the graph.
>>>>>>> sample:
>>>>>>> 1    2 (here 1 is source and 2 is sink node for the edge)
>>>>>>> 1    5
>>>>>>> 2    3
>>>>>>> 4    2
>>>>>>> 4    3
>>>>>>> 5    6
>>>>>>> 5    4
>>>>>>> 5    7
>>>>>>> 7    8
>>>>>>> 8    9
>>>>>>> 8    10
>>>>>>>
>>>>>>> 2. Partition file :
>>>>>>>          This file contains one line for each vertex. Each line
has
>>>>>>> two values first number is <vertex id> and second number
is <partition id >
>>>>>>>  sample : <vertex id>  <partition id >
>>>>>>> 2    1
>>>>>>> 3    1
>>>>>>> 4    1
>>>>>>> 5    2
>>>>>>> 6    2
>>>>>>> 7    2
>>>>>>> 8    1
>>>>>>> 9    1
>>>>>>> 10    1
>>>>>>>
>>>>>>>
>>>>>>> The Edge list file is having size of 32Gb, while partition file
is
>>>>>>> of 10Gb.
>>>>>>> (size is so large that map/reduce can read only partition file
. I
>>>>>>> have 20 node cluster with 24Gb memory per node.)
>>>>>>>
>>>>>>> My aim is to get all vertices (along with their adjacency list
>>>>>>> )those  having same partition id in one reducer so that I can
perform
>>>>>>> further analytics on a given partition in reducer.
>>>>>>>
>>>>>>> Is there any way in hadoop to get join of these two file in mapper
>>>>>>> and so that I can map based on the partition id ?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Ravikant
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harshit Mathur
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Harshit Mathur
>>>
>>
>>
>
>
> --
> Harshit Mathur
>


-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Mime
View raw message