Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 567D172ED for ; Fri, 16 Sep 2011 20:25:07 +0000 (UTC) Received: (qmail 38101 invoked by uid 500); 16 Sep 2011 20:25:05 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 38074 invoked by uid 500); 16 Sep 2011 20:25:05 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 38063 invoked by uid 99); 16 Sep 2011 20:25:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 20:25:05 +0000 X-ASF-Spam-Status: No, hits=4.7 required=5.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com designates 65.55.34.78 as permitted sender) Received: from [65.55.34.78] (HELO col0-omc2-s4.col0.hotmail.com) (65.55.34.78) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 Sep 2011 20:24:58 +0000 Received: from COL117-W59 ([65.55.34.72]) by col0-omc2-s4.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Fri, 16 Sep 2011 13:24:37 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_5bfff7ae-3469-4fb1-bb2e-d47a25bc4c46_" X-Originating-IP: [65.167.11.254] From: Michael Segel To: Subject: RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ... Date: Fri, 16 Sep 2011 15:24:36 -0500 Importance: Normal In-Reply-To: References: , MIME-Version: 1.0 X-OriginalArrivalTime: 16 Sep 2011 20:24:37.0207 (UTC) FILETIME=[A79C4670:01CC74AE] X-Virus-Checked: Checked by ClamAV on apache.org --_5bfff7ae-3469-4fb1-bb2e-d47a25bc4c46_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Doug and company... Look=2C I'm not saying that there aren't m/r jobs were you might need reduc= ers when working w HBase. What I am saying is that if we look at what you'r= e attempting to do=2C you may end up getting better performance if you crea= ted a temp table in HBase and let HBase do some of the heavy lifting where = you are currently using a reducer. From the jobs that we run=2C when we loo= ked at what we were doing=2C there wasn't any need for a reducer. I suspect= that its true of other jobs.=20 Remember that HBase is much more than just an HFile format to persist stuff= . Even looking at Sonal's example... you have other ways of doing the record = counts like dynamic counters or using a temp table in HBase which I believe= will give you better performance numbers=2C although I haven't benchmarked= either against a reducer. Does that make sense? -Mike > From: doug.meil@explorysmedical.com > To: user@hbase.apache.org > Date: Fri=2C 16 Sep 2011 15:41:44 -0400 > Subject: Re: Writing MR-Job: Something like OracleReducer=2C JDBCReducer = ... >=20 >=20 > Chris=2C agreed... There are sometimes that reducers aren't required=2C a= nd > then situations where they are useful. We have both kinds of jobs. >=20 > For others following the thread=2C I updated the book recently with more = MR > examples (read-only=2C read-write=2C read-summary) >=20 > http://hbase.apache.org/book.html#mapreduce.example >=20 >=20 > As to the question that started this thread... >=20 >=20 > re: "Store aggregated data in Oracle. " >=20 > To me=2C that sounds a like the "read-summary" example with JDBC-Oracle i= n > the reduce step. >=20 >=20 >=20 >=20 >=20 > On 9/16/11 2:58 PM=2C "Chris Tarnas" wrote: >=20 > >If only I could make NY in Nov :) > > > >We extract out large numbers of DNA sequence reads from HBase=2C run the= m > >through M/R pipelines to analyze and aggregate and then we load the > >results back in. Definitely specialized usage=2C but I could see other > >perfectly valid uses for reducers with HBase. > > > >-chris > >=20 > >On Sep 16=2C 2011=2C at 11:43 AM=2C Michael Segel wrote: > > > >>=20 > >> Sonal=2C > >>=20 > >> You do realize that HBase is a "database"=2C right? =3B-) > >>=20 > >> So again=2C why do you need a reducer? =3B-) > >>=20 > >> Using your example... > >> "Again=2C there will be many cases where one may want a reducer=2C say > >>trying to count the occurrence of words in a particular column." > >>=20 > >> You can do this one of two ways... > >> 1) Dynamic Counters in Hadoop. > >> 2) Use a temp table and auto increment the value in a column which > >>contains the word count. (Fat row where rowkey is doc_id and column is > >>word or rowkey is doc_id|word) > >>=20 > >> I'm sorry but if you go through all of your examples of why you would > >>want to use a reducer=2C you end up finding out that writing to an HBas= e > >>table would be faster than a reduce job. > >> (Again we haven't done an exhaustive search=2C but in all of the HBase > >>jobs we've run... no reducers were necessary.) > >>=20 > >> The point I'm trying to make is that you want to avoid using a reducer > >>whenever possible and if you think about your problem... you can > >>probably come up with a solution that avoids the reducer... > >>=20 > >>=20 > >> HTH > >>=20 > >> -Mike > >> PS. I haven't looked at *all* of the potential use cases of HBase whic= h > >>is why I don't want to say you'll never need a reducer. I will say that > >>based on what we've done at my client's site=2C we try very hard to avo= id > >>reducers. > >> [Note=2C I'm sure I'm going to get hammered on this when I head to NY = in > >>Nov. :-) ] > >>=20 > >>=20 > >>> Date: Fri=2C 16 Sep 2011 23:00:49 +0530 > >>> Subject: Re: Writing MR-Job: Something like OracleReducer=2C JDBCRedu= cer > >>>... > >>> From: sonalgoyal4@gmail.com > >>> To: user@hbase.apache.org > >>>=20 > >>> Hi Michael=2C > >>>=20 > >>> Yes=2C thanks=2C I understand the fact that reducers can be expensive= with > >>>all > >>> the shuffling and the sorting=2C and you may not need them always. At > >>>the same > >>> time=2C there are many cases where reducers are useful=2C like second= ary > >>> sorting. In many cases=2C one can have multiple map phases and not ha= ve a > >>> reduce phase at all. Again=2C there will be many cases where one may > >>>want a > >>> reducer=2C say trying to count the occurrence of words in a particula= r > >>>column. > >>>=20 > >>>=20 > >>> With this thought chain=2C I do not feel ready to say that when deali= ng > >>>with > >>> HBase=2C I really dont want to use a reducer. Please correct me if I = am > >>> wrong. > >>>=20 > >>> Thanks again. > >>>=20 > >>> Best Regards=2C > >>> Sonal > >>> Crux: Reporting for HBase > >>> Nube Technologies > >>>=20 > >>> > >>>=20 > >>>=20 > >>>=20 > >>>=20 > >>>=20 > >>> On Fri=2C Sep 16=2C 2011 at 10:35 PM=2C Michael Segel > >>> wrote: > >>>=20 > >>>>=20 > >>>> Sonal=2C > >>>>=20 > >>>> Just because you have a m/r job doesn't mean that you need to reduce > >>>> anything. You can have a job that contains only a mapper. > >>>> Or your job runner can have a series of map jobs in serial. > >>>>=20 > >>>> Most if not all of the map/reduce jobs where we pull data from HBase= =2C > >>>>don't > >>>> require a reducer. > >>>>=20 > >>>> To give you a simple example... if I want to determine the table > >>>>schema > >>>> where I am storing some sort of structured data... > >>>> I just write a m/r job which opens a table=2C scan's the table count= ing > >>>>the > >>>> occurrence of each column name via dynamic counters. > >>>>=20 > >>>> There is no need for a reducer. > >>>>=20 > >>>> Does that help? > >>>>=20 > >>>>=20 > >>>>> Date: Fri=2C 16 Sep 2011 21:41:01 +0530 > >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer=2C > >>>>>JDBCReducer > >>>> ... > >>>>> From: sonalgoyal4@gmail.com > >>>>> To: user@hbase.apache.org > >>>>>=20 > >>>>> Michel=2C > >>>>>=20 > >>>>> Sorry can you please help me understand what you mean when you say > >>>>>that > >>>> when > >>>>> dealing with HBase=2C you really dont want to use a reducer? Here= =2C > >>>>>Hbase is > >>>>> being used as the input to the MR job. > >>>>>=20 > >>>>> Thanks > >>>>> Sonal > >>>>>=20 > >>>>>=20 > >>>>> On Fri=2C Sep 16=2C 2011 at 2:35 PM=2C Michel Segel > >>>>> >>>>> wrote: > >>>>>=20 > >>>>>> I think you need to get a little bit more information. > >>>>>> Reducers are expensive. > >>>>>> When Thomas says that he is aggregating data=2C what exactly does = he > >>>> mean? > >>>>>> When dealing w HBase=2C you really don't want to use a reducer. > >>>>>>=20 > >>>>>> You may want to run two map jobs and it could be that just dumping > >>>>>>the > >>>>>> output via jdbc makes the most sense. > >>>>>>=20 > >>>>>> We are starting to see a lot of questions where the OP isn't > >>>>>>providing > >>>>>> enough information so that the recommendation could be wrong... > >>>>>>=20 > >>>>>>=20 > >>>>>> Sent from a remote device. Please excuse any typos... > >>>>>>=20 > >>>>>> Mike Segel > >>>>>>=20 > >>>>>> On Sep 16=2C 2011=2C at 2:22 AM=2C Sonal Goyal > >>>> wrote: > >>>>>>=20 > >>>>>>> There is a DBOutputFormat class in the > >>>> org.apache=2Chadoop.mapreduce.lib.db > >>>>>>> package=2C you could use that. Or you could write to the hdfs and > >>>>>>>then > >>>> use > >>>>>>> something like HIHO[1] to export to the db. I have been working > >>>>>> extensively > >>>>>>> in this area=2C you can write to me directly if you need any help= . > >>>>>>>=20 > >>>>>>> 1. https://github.com/sonalgoyal/hiho > >>>>>>>=20 > >>>>>>> Best Regards=2C > >>>>>>> Sonal > >>>>>>> Crux: Reporting for HBase > >>>>>>> Nube Technologies > >>>>>>>=20 > >>>>>>> > >>>>>>>=20 > >>>>>>>=20 > >>>>>>>=20 > >>>>>>>=20 > >>>>>>>=20 > >>>>>>> On Fri=2C Sep 16=2C 2011 at 10:55 AM=2C Steinmaurer Thomas < > >>>>>>> Thomas.Steinmaurer@scch.at> wrote: > >>>>>>>=20 > >>>>>>>> Hello=2C > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>> writing a MR-Job to process HBase data and store aggregated data > >>>>>>>>in > >>>>>>>> Oracle. How would you do that in a MR-job? > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>> Currently=2C for test purposes we write the result into a HBase > >>>>>>>>table > >>>>>>>> again by using a TableReducer. Is there something like a > >>>> OracleReducer=2C > >>>>>>>> RelationalReducer=2C JDBCReducer or whatever? Or should one simp= ly > >>>>>>>>use > >>>>>>>> plan JDBC code in the reduce step? > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>> Thanks! > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>> Thomas > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>>>=20 > >>>>>>=20 > >>>>=20 > >>>>=20 > >> =20 > > >=20 = --_5bfff7ae-3469-4fb1-bb2e-d47a25bc4c46_--