Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (nike.apache.org: domain of michael_segel@hotmail.com
 designates 65.55.34.78 as permitted sender)
Message-ID: <COL117-W59EAFB7C9C6545842B92ED8F060@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_5bfff7ae-3469-4fb1-bb2e-d47a25bc4c46_"
From: Michael Segel <michael_segel@hotmail.com>
To: <user@hbase.apache.org>
Subject: RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
Date: Fri, 16 Sep 2011 15:24:36 -0500
Importance: Normal
In-Reply-To: <CA991D95.EE19%doug.meil@explorysmedical.com>
References: 
 <F5E0156E-3D7C-4C04-9AEB-C1C2F9B505E3@email.com>,<CA991D95.EE19%doug.meil@explorysmedical.com>
MIME-Version: 1.0

--_5bfff7ae-3469-4fb1-bb2e-d47a25bc4c46_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Doug and company...

Look=2C I'm not saying that there aren't m/r jobs were you might need reduc=
ers when working w HBase. What I am saying is that if we look at what you'r=
e attempting to do=2C you may end up getting better performance if you crea=
ted a temp table in HBase and let HBase do some of the heavy lifting where =
you are currently using a reducer. From the jobs that we run=2C when we loo=
ked at what we were doing=2C there wasn't any need for a reducer. I suspect=
 that its true of other jobs.=20

Remember that HBase is much more than just an HFile format to persist stuff=
.

Even looking at Sonal's example... you have other ways of doing the record =
counts like dynamic counters or using a temp table in HBase which I believe=
 will give you better performance numbers=2C although I haven't benchmarked=
 either against a reducer.

Does that make sense?

-Mike


> From: doug.meil@explorysmedical.com
> To: user@hbase.apache.org
> Date: Fri=2C 16 Sep 2011 15:41:44 -0400
> Subject: Re: Writing MR-Job: Something like OracleReducer=2C JDBCReducer =
...
>=20
>=20
> Chris=2C agreed... There are sometimes that reducers aren't required=2C a=
nd
> then situations where they are useful.  We have both kinds of jobs.
>=20
> For others following the thread=2C I updated the book recently with more =
MR
> examples (read-only=2C read-write=2C read-summary)
>=20
> http://hbase.apache.org/book.html#mapreduce.example
>=20
>=20
> As to the question that started this thread...
>=20
>=20
> re:  "Store aggregated data in Oracle. "
>=20
> To me=2C that sounds a like the "read-summary" example with JDBC-Oracle i=
n
> the reduce step.
>=20
>=20
>=20
>=20
>=20
> On 9/16/11 2:58 PM=2C "Chris Tarnas" <cft@email.com> wrote:
>=20
> >If only I could make NY in Nov :)
> >
> >We extract out large numbers of DNA sequence reads from HBase=2C run the=
m
> >through M/R pipelines to analyze and aggregate and then we load the
> >results back in. Definitely specialized usage=2C but I could see other
> >perfectly valid uses for reducers with HBase.
> >
> >-chris
> >=20
> >On Sep 16=2C 2011=2C at 11:43 AM=2C Michael Segel wrote:
> >
> >>=20
> >> Sonal=2C
> >>=20
> >> You do realize that HBase is a "database"=2C right? =3B-)
> >>=20
> >> So again=2C why do you need a reducer?  =3B-)
> >>=20
> >> Using your example...
> >> "Again=2C there will be many cases where one may want a reducer=2C say
> >>trying to count the occurrence of words in a particular column."
> >>=20
> >> You can do this one of two ways...
> >> 1) Dynamic Counters in Hadoop.
> >> 2) Use a temp table and auto increment the value in a column which
> >>contains the word count.  (Fat row where rowkey is doc_id and column is
> >>word or rowkey is doc_id|word)
> >>=20
> >> I'm sorry but if you go through all of your examples of why you would
> >>want to use a reducer=2C you end up finding out that writing to an HBas=
e
> >>table would be faster than a reduce job.
> >> (Again we haven't done an exhaustive search=2C but in all of the HBase
> >>jobs we've run... no reducers were necessary.)
> >>=20
> >> The point I'm trying to make is that you want to avoid using a reducer
> >>whenever possible and if you think about your problem... you can
> >>probably come up with a solution that avoids the reducer...
> >>=20
> >>=20
> >> HTH
> >>=20
> >> -Mike
> >> PS. I haven't looked at *all* of the potential use cases of HBase whic=
h
> >>is why I don't want to say you'll never need a reducer. I will say that
> >>based on what we've done at my client's site=2C we try very hard to avo=
id
> >>reducers.
> >> [Note=2C I'm sure I'm going to get hammered on this when I head to NY =
in
> >>Nov. :-)   ]
> >>=20
> >>=20
> >>> Date: Fri=2C 16 Sep 2011 23:00:49 +0530
> >>> Subject: Re: Writing MR-Job: Something like OracleReducer=2C JDBCRedu=
cer
> >>>...
> >>> From: sonalgoyal4@gmail.com
> >>> To: user@hbase.apache.org
> >>>=20
> >>> Hi Michael=2C
> >>>=20
> >>> Yes=2C thanks=2C I understand the fact that reducers can be expensive=
 with
> >>>all
> >>> the shuffling and the sorting=2C and you may not need them always. At
> >>>the same
> >>> time=2C there are many cases where reducers are useful=2C like second=
ary
> >>> sorting. In many cases=2C one can have multiple map phases and not ha=
ve a
> >>> reduce phase at all. Again=2C there will be many cases where one may
> >>>want a
> >>> reducer=2C say trying to count the occurrence of words in a particula=
r
> >>>column.
> >>>=20
> >>>=20
> >>> With this thought chain=2C I do not feel ready to say that when deali=
ng
> >>>with
> >>> HBase=2C I really dont want to use a reducer. Please correct me if I =
am
> >>> wrong.
> >>>=20
> >>> Thanks again.
> >>>=20
> >>> Best Regards=2C
> >>> Sonal
> >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> >>> Nube Technologies <http://www.nubetech.co>
> >>>=20
> >>> <http://in.linkedin.com/in/sonalgoyal>
> >>>=20
> >>>=20
> >>>=20
> >>>=20
> >>>=20
> >>> On Fri=2C Sep 16=2C 2011 at 10:35 PM=2C Michael Segel
> >>> <michael_segel@hotmail.com>wrote:
> >>>=20
> >>>>=20
> >>>> Sonal=2C
> >>>>=20
> >>>> Just because you have a m/r job doesn't mean that you need to reduce
> >>>> anything. You can have a job that contains only a mapper.
> >>>> Or your job runner can have a series of map jobs in serial.
> >>>>=20
> >>>> Most if not all of the map/reduce jobs where we pull data from HBase=
=2C
> >>>>don't
> >>>> require a reducer.
> >>>>=20
> >>>> To give you a simple example... if I want to determine the table
> >>>>schema
> >>>> where I am storing some sort of structured data...
> >>>> I just write a m/r job which opens a table=2C scan's the table count=
ing
> >>>>the
> >>>> occurrence of each column name via dynamic counters.
> >>>>=20
> >>>> There is no need for a reducer.
> >>>>=20
> >>>> Does that help?
> >>>>=20
> >>>>=20
> >>>>> Date: Fri=2C 16 Sep 2011 21:41:01 +0530
> >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer=2C
> >>>>>JDBCReducer
> >>>> ...
> >>>>> From: sonalgoyal4@gmail.com
> >>>>> To: user@hbase.apache.org
> >>>>>=20
> >>>>> Michel=2C
> >>>>>=20
> >>>>> Sorry can you please help me understand what you mean when you say
> >>>>>that
> >>>> when
> >>>>> dealing with HBase=2C you really dont want to use a reducer? Here=
=2C
> >>>>>Hbase is
> >>>>> being used as the input to the MR job.
> >>>>>=20
> >>>>> Thanks
> >>>>> Sonal
> >>>>>=20
> >>>>>=20
> >>>>> On Fri=2C Sep 16=2C 2011 at 2:35 PM=2C Michel Segel
> >>>>><michael_segel@hotmail.com
> >>>>> wrote:
> >>>>>=20
> >>>>>> I think you need to get a little bit more information.
> >>>>>> Reducers are expensive.
> >>>>>> When Thomas says that he is aggregating data=2C what exactly does =
he
> >>>> mean?
> >>>>>> When dealing w HBase=2C you really don't want to use a reducer.
> >>>>>>=20
> >>>>>> You may want to run two map jobs and it could be that just dumping
> >>>>>>the
> >>>>>> output via jdbc makes the most sense.
> >>>>>>=20
> >>>>>> We are starting to see a lot of questions where the OP isn't
> >>>>>>providing
> >>>>>> enough information so that the recommendation could be wrong...
> >>>>>>=20
> >>>>>>=20
> >>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>>=20
> >>>>>> Mike Segel
> >>>>>>=20
> >>>>>> On Sep 16=2C 2011=2C at 2:22 AM=2C Sonal Goyal <sonalgoyal4@gmail.=
com>
> >>>> wrote:
> >>>>>>=20
> >>>>>>> There is a DBOutputFormat class in the
> >>>> org.apache=2Chadoop.mapreduce.lib.db
> >>>>>>> package=2C you could use that. Or you could write to the hdfs and
> >>>>>>>then
> >>>> use
> >>>>>>> something like HIHO[1] to export to the db. I have been working
> >>>>>> extensively
> >>>>>>> in this area=2C you can write to me directly if you need any help=
.
> >>>>>>>=20
> >>>>>>> 1. https://github.com/sonalgoyal/hiho
> >>>>>>>=20
> >>>>>>> Best Regards=2C
> >>>>>>> Sonal
> >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> >>>>>>> Nube Technologies <http://www.nubetech.co>
> >>>>>>>=20
> >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
> >>>>>>>=20
> >>>>>>>=20
> >>>>>>>=20
> >>>>>>>=20
> >>>>>>>=20
> >>>>>>> On Fri=2C Sep 16=2C 2011 at 10:55 AM=2C Steinmaurer Thomas <
> >>>>>>> Thomas.Steinmaurer@scch.at> wrote:
> >>>>>>>=20
> >>>>>>>> Hello=2C
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>> writing a MR-Job to process HBase data and store aggregated data
> >>>>>>>>in
> >>>>>>>> Oracle. How would you do that in a MR-job?
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>> Currently=2C for test purposes we write the result into a HBase
> >>>>>>>>table
> >>>>>>>> again by using a TableReducer. Is there something like a
> >>>> OracleReducer=2C
> >>>>>>>> RelationalReducer=2C JDBCReducer or whatever? Or should one simp=
ly
> >>>>>>>>use
> >>>>>>>> plan JDBC code in the reduce step?
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>> Thanks!
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>> Thomas
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>>>=20
> >>>>>>=20
> >>>>=20
> >>>>=20
> >> 		 	   		 =20
> >
>=20
 		 	   		  =

--_5bfff7ae-3469-4fb1-bb2e-d47a25bc4c46_--