hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ashwanth Kumar <ashwanthku...@googlemail.com>
Subject Re: Calling one MR job within another MR job
Date Thu, 05 Apr 2012 06:58:58 GMT
What I understand is you are looking for a value based on key, I guess you
should look at a Key-Value Datastore (like Voldermort). But Again accessing
the datastore for each key in 2nd MR Job would be a costly operation, which
might require additional tuning the datastore.

PS - I am not sure if this is a good practice.

On Thu, Apr 5, 2012 at 11:37 AM, Stuti Awasthi <stutiawasthi@hcl.com> wrote:

> Thanks everyone,****
>
> ** **
>
> So with this discussion, there are 2 main opinions I got :****
>
> **1.      **Not to call one MR job from inside another MR job.****
>
> **2.      **Can use distributed cache (but not good for very large file).*
> ***
>
> I want to design the system so that I can efficiently do the processing.
> So if I run MR job to process File2 first and store its data in
> KeyValueFormat in HDFS.****
>
> Once this job is complete, I start with another MR job to process File1.
> Now since each I/p line of File1 will require  to get the some data from
> output of first MR job.****
>
> **1.      **Normal way to do is , For each input line for 2nd MR job, it
> will loop through the contents of output from MR job1 and get the relevant
> data for processing.****
>
> **2.      **Since I have stored output of File2 in key-value format, can
> I directly get the value for specific key.****
>
> ** **
>
> So I want to know that if I have output1 in KeyValueFormat in HDFS. I run
> a separate job with different I/p file and wants to access data from
> output1 on the basis of keys, can we attain that without looping output1.*
> ***
>
> ** **
>
> Thanks****
>
> ** **
>
> *From:* Praveen Kumar K J V S [mailto:praveenkjvs.developer@gmail.com]
> *Sent:* Wednesday, April 04, 2012 6:43 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Re: Calling one MR job within another MR job****
>
> ** **
>
> Dear Stuti,****
>
>  ****
>
> As per the mail chain I uderstand you want to do SetJoin on two sets File1
> and File2 with some join finction F(F1,F2). On this assumption, please find
> my reply below:****
>
>  ****
>
> Set join is not simple and that too if input the input is very large. It
> essestially does a cartesian product between the two sets F1 and F2 and
> filter out the required data based on some function F(F1, F2).****
>
>  ****
>
> What i mean is say you have two files each with 10Lakh lines, then to
> perform a set join you essentially do 100Lakh operations and filter
> phase works on these 100Lakh results to filter out the required ones.****
>
>  ****
>
> Hence such a problem being exponentially inreasing in input size, it is
> helpful if you know how to Set-Join funciton works. having such insight is
> helpful.****
>
>  ****
>
> Though I have to admit, that these kind of problems are still under active
> reasearch, please refer links below for more detail:****
>
>  ****
>
>    1. http://www.youtube.com/watch?v=kiuUGXWRzPA - google tech talks****
>       2.
>       http://www.slideshare.net/rvernica/efficient-parallel-setsimilarity-joins-using-mapreduce-sigmod-2010-slides
>       ****
>       3. http://research.microsoft.com/apps/pubs/default.aspx?id=76165****
>       4. http://www.slideshare.net/ydn/4-similarity-joinshadoopsummit2010*
>       ***
>
> @Distributed cahce: Its not great if you have huge files. By Default you
> have to size limit of 10GB as the max size for a distributed file****
>
>  ****
>
> @launching jobs inside a mapper: Not a great idea, because for every key
> value you will launch a job and so essentially you will end up launching
> very huge number of jobs. Absolutely No No. A bug in production can bring
> down the cluster. Also its difficult to track all these jobs.****
>
>  ****
>
> Thank,****
>
> Praveen****
>
> On Wed, Apr 4, 2012 at 6:17 PM, <jagatsingh@gmail.com> wrote:****
>
> Hello Stuti
>
> The way you have explained it seems we can think about caching the file2
> already in nodes.
>
> -- Just out of context , In the same way replicated joins are being
> handled in Pig in which one file (file2) to be joined is cached in the
> memory by file1.
>
> Regards
>
> Jagat****
>
>  ****
>
> ----- Original Message -----****
>
> From: Stuti Awasthi****
>
> Sent: 04/04/12 07:55 AM****
>
> To: mapreduce-user@hadoop.apache.org****
>
> Subject: RE: Calling one MR job within another MR job****
>
> ** **
>
> Hi Ravi,****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> There is no job dependency so I cannot use chaining MR or JobControl as
> you suggested.****
>
>  ****
>
>  ****
>
> I have 2 relatively big files, I start processing with File1 as input to
> MR1 job , now this processing required to find the data from File2. One way
> to do is loop through File2 and get the data. Other way to pass File2 in
> MR2 job for parallel processing.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Second option is making hinting me to call an MR2 job inside from MR1 job.
> I am sure this is the common problem that people usually face. What is the
> best way to resolve this  kind of issue.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Thanks****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> *From:* Ravi teja ch n v [mailto:raviteja.chnv@huawei.com]
> *Sent:* Wednesday, April 04, 2012 4:35 PM
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* RE: Calling one MR job within another MR job****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Hi Stuti,****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> If you are looking for MRjob2 to run after MRjob1, ie the job dependency,
> ****
>
>  ****
>
>  ****
>
> you can use JobControl API, where you can manage the dependencies.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Calling another Job from a Mapper is not a good idea.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Thanks,****
>
>  ****
>
>  ****
>
> Ravi Teja****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
> ------------------------------
>
> *From:* Stuti Awasthi [stutiawasthi@hcl.com]
> *Sent:* 04 April 2012 16:04:19
> *To:* mapreduce-user@hadoop.apache.org
> *Subject:* Calling one MR job within another MR job****
>
>  ****
>
>  ****
>
> Hi all,****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> We have a usecase in which I start with first MR1 job with input file as
> File1.txt, and from this job, call another MR2 job with input as File2.txt
> ****
>
>  ****
>
>  ****
>
> So :****
>
>  ****
>
>  ****
>
> MRjob1{****
>
>  ****
>
>  ****
>
> Map(){****
>
>  ****
>
>  ****
>
> MRJob2(File2.txt)****
>
>  ****
>
>  ****
>
> }****
>
>  ****
>
>  ****
>
> }****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> MRJob2{****
>
>  ****
>
>  ****
>
> Processing….****
>
>  ****
>
>  ****
>
> }****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> My queries are is this kind of approach is possible and how much are the
> implications from the performance perspective.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
> Regards,****
>
>  ****
>
>  ****
>
> *Stuti Awasthi*****
>
>  ****
>
>  ****
>
> HCL Comnet Systems and Services Ltd****
>
>  ****
>
>  ****
>
> F-8/9 Basement, Sec-3,Noida.****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
>
>  ****
> ------------------------------
>
> ::DISCLAIMER::
>
> -----------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its
> affiliates. Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect
> the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure,
> modification, distribution and / or publication of
> this message without the prior written consent of the author of this
> e-mail is strictly prohibited. If you have
> received this email in error please delete it and notify the sender
> immediately. Before opening any mail and
> attachments please check them for viruses and defect.
>
>
> -----------------------------------------------------------------------------------------------------------------------
> ****
>
>  ****
>
>  ****
>
>  ****
>
>
>
>
>
>   ****
>
> ** **
>



-- 

Ashwanth Kumar / ashwanthkumar.in

Mime
View raw message