hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <michael_se...@hotmail.com>
Subject RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
Date Fri, 16 Sep 2011 20:24:36 GMT

Doug and company...

Look, I'm not saying that there aren't m/r jobs were you might need reducers when working
w HBase. What I am saying is that if we look at what you're attempting to do, you may end
up getting better performance if you created a temp table in HBase and let HBase do some of
the heavy lifting where you are currently using a reducer. From the jobs that we run, when
we looked at what we were doing, there wasn't any need for a reducer. I suspect that its true
of other jobs. 

Remember that HBase is much more than just an HFile format to persist stuff.

Even looking at Sonal's example... you have other ways of doing the record counts like dynamic
counters or using a temp table in HBase which I believe will give you better performance numbers,
although I haven't benchmarked either against a reducer.

Does that make sense?

-Mike


> From: doug.meil@explorysmedical.com
> To: user@hbase.apache.org
> Date: Fri, 16 Sep 2011 15:41:44 -0400
> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
> 
> 
> Chris, agreed... There are sometimes that reducers aren't required, and
> then situations where they are useful.  We have both kinds of jobs.
> 
> For others following the thread, I updated the book recently with more MR
> examples (read-only, read-write, read-summary)
> 
> http://hbase.apache.org/book.html#mapreduce.example
> 
> 
> As to the question that started this thread...
> 
> 
> re:  "Store aggregated data in Oracle. "
> 
> To me, that sounds a like the "read-summary" example with JDBC-Oracle in
> the reduce step.
> 
> 
> 
> 
> 
> On 9/16/11 2:58 PM, "Chris Tarnas" <cft@email.com> wrote:
> 
> >If only I could make NY in Nov :)
> >
> >We extract out large numbers of DNA sequence reads from HBase, run them
> >through M/R pipelines to analyze and aggregate and then we load the
> >results back in. Definitely specialized usage, but I could see other
> >perfectly valid uses for reducers with HBase.
> >
> >-chris
> > 
> >On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
> >
> >> 
> >> Sonal,
> >> 
> >> You do realize that HBase is a "database", right? ;-)
> >> 
> >> So again, why do you need a reducer?  ;-)
> >> 
> >> Using your example...
> >> "Again, there will be many cases where one may want a reducer, say
> >>trying to count the occurrence of words in a particular column."
> >> 
> >> You can do this one of two ways...
> >> 1) Dynamic Counters in Hadoop.
> >> 2) Use a temp table and auto increment the value in a column which
> >>contains the word count.  (Fat row where rowkey is doc_id and column is
> >>word or rowkey is doc_id|word)
> >> 
> >> I'm sorry but if you go through all of your examples of why you would
> >>want to use a reducer, you end up finding out that writing to an HBase
> >>table would be faster than a reduce job.
> >> (Again we haven't done an exhaustive search, but in all of the HBase
> >>jobs we've run... no reducers were necessary.)
> >> 
> >> The point I'm trying to make is that you want to avoid using a reducer
> >>whenever possible and if you think about your problem... you can
> >>probably come up with a solution that avoids the reducer...
> >> 
> >> 
> >> HTH
> >> 
> >> -Mike
> >> PS. I haven't looked at *all* of the potential use cases of HBase which
> >>is why I don't want to say you'll never need a reducer. I will say that
> >>based on what we've done at my client's site, we try very hard to avoid
> >>reducers.
> >> [Note, I'm sure I'm going to get hammered on this when I head to NY in
> >>Nov. :-)   ]
> >> 
> >> 
> >>> Date: Fri, 16 Sep 2011 23:00:49 +0530
> >>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
> >>>...
> >>> From: sonalgoyal4@gmail.com
> >>> To: user@hbase.apache.org
> >>> 
> >>> Hi Michael,
> >>> 
> >>> Yes, thanks, I understand the fact that reducers can be expensive with
> >>>all
> >>> the shuffling and the sorting, and you may not need them always. At
> >>>the same
> >>> time, there are many cases where reducers are useful, like secondary
> >>> sorting. In many cases, one can have multiple map phases and not have a
> >>> reduce phase at all. Again, there will be many cases where one may
> >>>want a
> >>> reducer, say trying to count the occurrence of words in a particular
> >>>column.
> >>> 
> >>> 
> >>> With this thought chain, I do not feel ready to say that when dealing
> >>>with
> >>> HBase, I really dont want to use a reducer. Please correct me if I am
> >>> wrong.
> >>> 
> >>> Thanks again.
> >>> 
> >>> Best Regards,
> >>> Sonal
> >>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> >>> Nube Technologies <http://www.nubetech.co>
> >>> 
> >>> <http://in.linkedin.com/in/sonalgoyal>
> >>> 
> >>> 
> >>> 
> >>> 
> >>> 
> >>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
> >>> <michael_segel@hotmail.com>wrote:
> >>> 
> >>>> 
> >>>> Sonal,
> >>>> 
> >>>> Just because you have a m/r job doesn't mean that you need to reduce
> >>>> anything. You can have a job that contains only a mapper.
> >>>> Or your job runner can have a series of map jobs in serial.
> >>>> 
> >>>> Most if not all of the map/reduce jobs where we pull data from HBase,
> >>>>don't
> >>>> require a reducer.
> >>>> 
> >>>> To give you a simple example... if I want to determine the table
> >>>>schema
> >>>> where I am storing some sort of structured data...
> >>>> I just write a m/r job which opens a table, scan's the table counting
> >>>>the
> >>>> occurrence of each column name via dynamic counters.
> >>>> 
> >>>> There is no need for a reducer.
> >>>> 
> >>>> Does that help?
> >>>> 
> >>>> 
> >>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
> >>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,
> >>>>>JDBCReducer
> >>>> ...
> >>>>> From: sonalgoyal4@gmail.com
> >>>>> To: user@hbase.apache.org
> >>>>> 
> >>>>> Michel,
> >>>>> 
> >>>>> Sorry can you please help me understand what you mean when you say
> >>>>>that
> >>>> when
> >>>>> dealing with HBase, you really dont want to use a reducer? Here,
> >>>>>Hbase is
> >>>>> being used as the input to the MR job.
> >>>>> 
> >>>>> Thanks
> >>>>> Sonal
> >>>>> 
> >>>>> 
> >>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel
> >>>>><michael_segel@hotmail.com
> >>>>> wrote:
> >>>>> 
> >>>>>> I think you need to get a little bit more information.
> >>>>>> Reducers are expensive.
> >>>>>> When Thomas says that he is aggregating data, what exactly does
he
> >>>> mean?
> >>>>>> When dealing w HBase, you really don't want to use a reducer.
> >>>>>> 
> >>>>>> You may want to run two map jobs and it could be that just dumping
> >>>>>>the
> >>>>>> output via jdbc makes the most sense.
> >>>>>> 
> >>>>>> We are starting to see a lot of questions where the OP isn't
> >>>>>>providing
> >>>>>> enough information so that the recommendation could be wrong...
> >>>>>> 
> >>>>>> 
> >>>>>> Sent from a remote device. Please excuse any typos...
> >>>>>> 
> >>>>>> Mike Segel
> >>>>>> 
> >>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <sonalgoyal4@gmail.com>
> >>>> wrote:
> >>>>>> 
> >>>>>>> There is a DBOutputFormat class in the
> >>>> org.apache,hadoop.mapreduce.lib.db
> >>>>>>> package, you could use that. Or you could write to the hdfs
and
> >>>>>>>then
> >>>> use
> >>>>>>> something like HIHO[1] to export to the db. I have been
working
> >>>>>> extensively
> >>>>>>> in this area, you can write to me directly if you need any
help.
> >>>>>>> 
> >>>>>>> 1. https://github.com/sonalgoyal/hiho
> >>>>>>> 
> >>>>>>> Best Regards,
> >>>>>>> Sonal
> >>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
> >>>>>>> Nube Technologies <http://www.nubetech.co>
> >>>>>>> 
> >>>>>>> <http://in.linkedin.com/in/sonalgoyal>
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> 
> >>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <
> >>>>>>> Thomas.Steinmaurer@scch.at> wrote:
> >>>>>>> 
> >>>>>>>> Hello,
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> writing a MR-Job to process HBase data and store aggregated
data
> >>>>>>>>in
> >>>>>>>> Oracle. How would you do that in a MR-job?
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Currently, for test purposes we write the result into
a HBase
> >>>>>>>>table
> >>>>>>>> again by using a TableReducer. Is there something like
a
> >>>> OracleReducer,
> >>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should
one simply
> >>>>>>>>use
> >>>>>>>> plan JDBC code in the reduce step?
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Thanks!
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> Thomas
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>>>> 
> >>>>>> 
> >>>> 
> >>>> 
> >> 		 	   		  
> >
> 
 		 	   		  
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message