hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tarnas <...@email.com>
Subject Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
Date Fri, 16 Sep 2011 22:34:00 GMT
But - if I may follow up on myself - I'll definitely keep my eyes more open for times when
we really don't need a reducer. I can see what you are saying and that people should think
a bit more laterally and use hbase for different and potentially more efficient workflows.

-chris

On Sep 16, 2011, at 2:54 PM, Chris Tarnas wrote:

> Hi Mike,
> 
> It's analysis* and aggregation, not just aggregation so it's a bit more complex. Each
row in the input generates at least one new row of data when we are done.
> 
> For our data sizes (~1 billion 2-3kb rows per job now and growing) we originally did
normal inserts, but then we switched to bulk imports - it was much faster and a lot less stress
on the regionservers. Bulk importing uses a reducer, so even if we went through and changed
our M/R pipelines to use a temp table for organized intermediate data, the most efficient
way to populate the temp table would be via the bulk loader - using a reducer anyway.
> 
> -chris
> 
> * Sorry to be broad but for business reasons I can't talk to much about the analysis
details.
> 
> 
> On Sep 16, 2011, at 1:11 PM, Michael Segel wrote:
> 
>> 
>> Chris,
>> 
>> I don't know what sort of aggregation you are doing, but again, why not write to
a temp table instead of using a reducer?
>> 
>> 
>> 
>> 
>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
>>> From: cft@email.com
>>> Date: Fri, 16 Sep 2011 11:58:05 -0700
>>> To: user@hbase.apache.org
>>> 
>>> If only I could make NY in Nov :)
>>> 
>>> We extract out large numbers of DNA sequence reads from HBase, run them through
M/R pipelines to analyze and aggregate and then we load the results back in. Definitely specialized
usage, but I could see other perfectly valid uses for reducers with HBase.
>>> 
>>> -chris
>>> 
>>> On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>> 
>>>> 
>>>> Sonal,
>>>> 
>>>> You do realize that HBase is a "database", right? ;-)
>>>> 
>>>> So again, why do you need a reducer?  ;-)
>>>> 
>>>> Using your example...
>>>> "Again, there will be many cases where one may want a reducer, say trying
to count the occurrence of words in a particular column."
>>>> 
>>>> You can do this one of two ways...
>>>> 1) Dynamic Counters in Hadoop.
>>>> 2) Use a temp table and auto increment the value in a column which contains
the word count.  (Fat row where rowkey is doc_id and column is word or rowkey is doc_id|word)
>>>> 
>>>> I'm sorry but if you go through all of your examples of why you would want
to use a reducer, you end up finding out that writing to an HBase table would be faster than
a reduce job.
>>>> (Again we haven't done an exhaustive search, but in all of the HBase jobs
we've run... no reducers were necessary.)
>>>> 
>>>> The point I'm trying to make is that you want to avoid using a reducer whenever
possible and if you think about your problem... you can probably come up with a solution that
avoids the reducer...
>>>> 
>>>> 
>>>> HTH
>>>> 
>>>> -Mike
>>>> PS. I haven't looked at *all* of the potential use cases of HBase which is
why I don't want to say you'll never need a reducer. I will say that based on what we've done
at my client's site, we try very hard to avoid reducers.
>>>> [Note, I'm sure I'm going to get hammered on this when I head to NY in Nov.
:-)   ]
>>>> 
>>>> 
>>>>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
...
>>>>> From: sonalgoyal4@gmail.com
>>>>> To: user@hbase.apache.org
>>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> Yes, thanks, I understand the fact that reducers can be expensive with
all
>>>>> the shuffling and the sorting, and you may not need them always. At the
same
>>>>> time, there are many cases where reducers are useful, like secondary
>>>>> sorting. In many cases, one can have multiple map phases and not have
a
>>>>> reduce phase at all. Again, there will be many cases where one may want
a
>>>>> reducer, say trying to count the occurrence of words in a particular
column.
>>>>> 
>>>>> 
>>>>> With this thought chain, I do not feel ready to say that when dealing
with
>>>>> HBase, I really dont want to use a reducer. Please correct me if I am
>>>>> wrong.
>>>>> 
>>>>> Thanks again.
>>>>> 
>>>>> Best Regards,
>>>>> Sonal
>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>> Nube Technologies <http://www.nubetech.co>
>>>>> 
>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>>> <michael_segel@hotmail.com>wrote:
>>>>> 
>>>>>> 
>>>>>> Sonal,
>>>>>> 
>>>>>> Just because you have a m/r job doesn't mean that you need to reduce
>>>>>> anything. You can have a job that contains only a mapper.
>>>>>> Or your job runner can have a series of map jobs in serial.
>>>>>> 
>>>>>> Most if not all of the map/reduce jobs where we pull data from HBase,
don't
>>>>>> require a reducer.
>>>>>> 
>>>>>> To give you a simple example... if I want to determine the table
schema
>>>>>> where I am storing some sort of structured data...
>>>>>> I just write a m/r job which opens a table, scan's the table counting
the
>>>>>> occurrence of each column name via dynamic counters.
>>>>>> 
>>>>>> There is no need for a reducer.
>>>>>> 
>>>>>> Does that help?
>>>>>> 
>>>>>> 
>>>>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
>>>>>> ...
>>>>>>> From: sonalgoyal4@gmail.com
>>>>>>> To: user@hbase.apache.org
>>>>>>> 
>>>>>>> Michel,
>>>>>>> 
>>>>>>> Sorry can you please help me understand what you mean when you
say that
>>>>>> when
>>>>>>> dealing with HBase, you really dont want to use a reducer? Here,
Hbase is
>>>>>>> being used as the input to the MR job.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Sonal
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel <michael_segel@hotmail.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> I think you need to get a little bit more information.
>>>>>>>> Reducers are expensive.
>>>>>>>> When Thomas says that he is aggregating data, what exactly
does he
>>>>>> mean?
>>>>>>>> When dealing w HBase, you really don't want to use a reducer.
>>>>>>>> 
>>>>>>>> You may want to run two map jobs and it could be that just
dumping the
>>>>>>>> output via jdbc makes the most sense.
>>>>>>>> 
>>>>>>>> We are starting to see a lot of questions where the OP isn't
providing
>>>>>>>> enough information so that the recommendation could be wrong...
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>> 
>>>>>>>> Mike Segel
>>>>>>>> 
>>>>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <sonalgoyal4@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> There is a DBOutputFormat class in the
>>>>>> org.apache,hadoop.mapreduce.lib.db
>>>>>>>>> package, you could use that. Or you could write to the
hdfs and then
>>>>>> use
>>>>>>>>> something like HIHO[1] to export to the db. I have been
working
>>>>>>>> extensively
>>>>>>>>> in this area, you can write to me directly if you need
any help.
>>>>>>>>> 
>>>>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>>>>> 
>>>>>>>>> Best Regards,
>>>>>>>>> Sonal
>>>>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>>>>> 
>>>>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas
<
>>>>>>>>> Thomas.Steinmaurer@scch.at> wrote:
>>>>>>>>> 
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> writing a MR-Job to process HBase data and store
aggregated data in
>>>>>>>>>> Oracle. How would you do that in a MR-job?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Currently, for test purposes we write the result
into a HBase table
>>>>>>>>>> again by using a TableReducer. Is there something
like a
>>>>>> OracleReducer,
>>>>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should
one simply use
>>>>>>>>>> plan JDBC code in the reduce step?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thomas
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 		 	   		  
>>> 
>> 		 	   		  
> 


Mime
View raw message