Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
Date: Mon, 19 Sep 2011 15:44:21 +0200
Message-ID: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DD6D@kairo.scch.at>
In-Reply-To: <CA9CBC8A.F069%doug.meil@explorysmedical.com>
Thread-Topic: Writing MR-Job: Something like OracleReducer, JDBCReducer ...
Thread-Index: Acx20LPJD1tLwrVKS6yeJF6wcKSkTgAAM9Yg
References: <84B5E4309B3B9F4ABFF7664C3CD7698302D0DD69@kairo.scch.at>
 <CA9CBC8A.F069%doug.meil@explorysmedical.com>
From: "Steinmaurer Thomas" <Thomas.Steinmaurer@scch.at>
To: <user@hbase.apache.org>

Hi Doug,

I know. The re-raised generic IOException is a bit unlucky, because it
could be the JDBC driver class can't be found or preparing the statement
failed.

I now took pretty much the same code as in
DBOutputFormat.getRecordWriter and tried that code in my implemented
ToolRunner.run method. Loading the JDBC driver class and preparing the
generated statement based on the provided table and field names set by
DBOutputFormat.setOutput(...) worked fine there, so I guess the
generated IOException isn't from a missing JDBC library etc ...

Any further ideas?

Btw: I'm using the Cloudera distribution available as VMWare.

Thanks!

Thomas

-----Original Message-----
From: Doug Meil [mailto:doug.meil@explorysmedical.com]=20
Sent: Montag, 19. September 2011 15:35
To: user@hbase.apache.org
Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer
...


Those were all from 'mapreduce', not 'mapred' packages.


This seems like it's an issue with DBOutputFormat...

org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOutp
utFormat.java:180)


On 9/19/11 1:41 AM, "Steinmaurer Thomas" <Thomas.Steinmaurer@scch.at>
wrote:

>Hi Doug,
>
>looked at your example and this looks pretty much what we have been=20
>done in our proof-of-concept implementation writing back to another=20
>HBase table by using a TableReducer. This works fine. We want to change

>that in a way that the final result is written to Oracle.
>
>When doing that, we end up with the following exception in the reduce=20
>step (see also my post "MR-Job: Exception in DBOutputFormat"):
>
>
>java.io.IOException
>        at
>org.apache.hadoop.mapreduce.lib.db.DBOutputFormat.getRecordWriter(DBOut
>p
>utFormat.java:180)
>        at
>org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:559)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
>        at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at javax.security.auth.Subject.doAs(Subject.java:396)
>        at
>org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformati
>o
>n.java:1127)
>        at org.apache.hadoop.mapred.Child.main(Child.java:264)
>
>
>Your examples a very welcome, because they are based on the mapreduce=20
>package, right? Pretty much all examples out there are based on mapred,

>which is AFAIK the "old" way to write MR-Jobs.
>
>
>Regards,
>Thomas
>
>
>
>-----Original Message-----
>From: Doug Meil [mailto:doug.meil@explorysmedical.com]
>Sent: Freitag, 16. September 2011 21:42
>To: user@hbase.apache.org
>Subject: Re: Writing MR-Job: Something like OracleReducer, JDBCReducer=20
>...
>
>
>Chris, agreed... There are sometimes that reducers aren't required, and

>then situations where they are useful.  We have both kinds of jobs.
>
>For others following the thread, I updated the book recently with more=20
>MR examples (read-only, read-write, read-summary)
>
>http://hbase.apache.org/book.html#mapreduce.example
>
>
>As to the question that started this thread...
>
>
>re:  "Store aggregated data in Oracle. "
>
>To me, that sounds a like the "read-summary" example with JDBC-Oracle=20
>in the reduce step.
>
>
>
>
>
>On 9/16/11 2:58 PM, "Chris Tarnas" <cft@email.com> wrote:
>
>>If only I could make NY in Nov :)
>>
>>We extract out large numbers of DNA sequence reads from HBase, run=20
>>them
>
>>through M/R pipelines to analyze and aggregate and then we load the=20
>>results back in. Definitely specialized usage, but I could see other=20
>>perfectly valid uses for reducers with HBase.
>>
>>-chris
>>=20
>>On Sep 16, 2011, at 11:43 AM, Michael Segel wrote:
>>
>>>=20
>>> Sonal,
>>>=20
>>> You do realize that HBase is a "database", right? ;-)
>>>=20
>>> So again, why do you need a reducer?  ;-)
>>>=20
>>> Using your example...
>>> "Again, there will be many cases where one may want a reducer, say=20
>>>trying to count the occurrence of words in a particular column."
>>>=20
>>> You can do this one of two ways...
>>> 1) Dynamic Counters in Hadoop.
>>> 2) Use a temp table and auto increment the value in a column which=20
>>>contains the word count.  (Fat row where rowkey is doc_id and column=20
>>>is word or rowkey is doc_id|word)
>>>=20
>>> I'm sorry but if you go through all of your examples of why you=20
>>> would
>
>>>want to use a reducer, you end up finding out that writing to an=20
>>>HBase
>
>>>table would be faster than a reduce job.
>>> (Again we haven't done an exhaustive search, but in all of the HBase

>>>jobs we've run... no reducers were necessary.)
>>>=20
>>> The point I'm trying to make is that you want to avoid using a=20
>>>reducer whenever possible and if you think about your problem... you=20
>>>can probably come up with a solution that avoids the reducer...
>>>=20
>>>=20
>>> HTH
>>>=20
>>> -Mike
>>> PS. I haven't looked at *all* of the potential use cases of HBase=20
>>>which is why I don't want to say you'll never need a reducer. I will=20
>>>say that based on what we've done at my client's site, we try very=20
>>>hard to avoid reducers.
>>> [Note, I'm sure I'm going to get hammered on this when I head to NY
>in
>>>Nov. :-)   ]
>>>=20
>>>=20
>>>> Date: Fri, 16 Sep 2011 23:00:49 +0530
>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,=20
>>>>JDBCReducer ...
>>>> From: sonalgoyal4@gmail.com
>>>> To: user@hbase.apache.org
>>>>=20
>>>> Hi Michael,
>>>>=20
>>>> Yes, thanks, I understand the fact that reducers can be expensive=20
>>>>with all  the shuffling and the sorting, and you may not need them=20
>>>>always. At the same  time, there are many cases where reducers are=20
>>>>useful, like secondary  sorting. In many cases, one can have=20
>>>>multiple
>
>>>>map phases and not have a  reduce phase at all. Again, there will be

>>>>many cases where one may want a  reducer, say trying to count the=20
>>>>occurrence of words in a particular column.
>>>>=20
>>>>=20
>>>> With this thought chain, I do not feel ready to say that when=20
>>>>dealing with  HBase, I really dont want to use a reducer. Please=20
>>>>correct me if I am  wrong.
>>>>=20
>>>> Thanks again.
>>>>=20
>>>> Best Regards,
>>>> Sonal
>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>> Nube Technologies <http://www.nubetech.co>
>>>>=20
>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> On Fri, Sep 16, 2011 at 10:35 PM, Michael Segel
>>>> <michael_segel@hotmail.com>wrote:
>>>>=20
>>>>>=20
>>>>> Sonal,
>>>>>=20
>>>>> Just because you have a m/r job doesn't mean that you need to=20
>>>>> reduce anything. You can have a job that contains only a mapper.
>>>>> Or your job runner can have a series of map jobs in serial.
>>>>>=20
>>>>> Most if not all of the map/reduce jobs where we pull data from=20
>>>>>HBase, don't  require a reducer.
>>>>>=20
>>>>> To give you a simple example... if I want to determine the table=20
>>>>>schema  where I am storing some sort of structured data...
>>>>> I just write a m/r job which opens a table, scan's the table=20
>>>>>counting the  occurrence of each column name via dynamic counters.
>>>>>=20
>>>>> There is no need for a reducer.
>>>>>=20
>>>>> Does that help?
>>>>>=20
>>>>>=20
>>>>>> Date: Fri, 16 Sep 2011 21:41:01 +0530
>>>>>> Subject: Re: Writing MR-Job: Something like OracleReducer,=20
>>>>>>JDBCReducer
>>>>> ...
>>>>>> From: sonalgoyal4@gmail.com
>>>>>> To: user@hbase.apache.org
>>>>>>=20
>>>>>> Michel,
>>>>>>=20
>>>>>> Sorry can you please help me understand what you mean when you=20
>>>>>> say
>
>>>>>>that
>>>>> when
>>>>>> dealing with HBase, you really dont want to use a reducer? Here,=20
>>>>>>Hbase is  being used as the input to the MR job.
>>>>>>=20
>>>>>> Thanks
>>>>>> Sonal
>>>>>>=20
>>>>>>=20
>>>>>> On Fri, Sep 16, 2011 at 2:35 PM, Michel Segel=20
>>>>>><michael_segel@hotmail.com
>>>>>> wrote:
>>>>>>=20
>>>>>>> I think you need to get a little bit more information.
>>>>>>> Reducers are expensive.
>>>>>>> When Thomas says that he is aggregating data, what exactly does=20
>>>>>>> he
>>>>> mean?
>>>>>>> When dealing w HBase, you really don't want to use a reducer.
>>>>>>>=20
>>>>>>> You may want to run two map jobs and it could be that just=20
>>>>>>>dumping the  output via jdbc makes the most sense.
>>>>>>>=20
>>>>>>> We are starting to see a lot of questions where the OP isn't=20
>>>>>>>providing  enough information so that the recommendation could be

>>>>>>>wrong...
>>>>>>>=20
>>>>>>>=20
>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>=20
>>>>>>> Mike Segel
>>>>>>>=20
>>>>>>> On Sep 16, 2011, at 2:22 AM, Sonal Goyal <sonalgoyal4@gmail.com>
>>>>> wrote:
>>>>>>>=20
>>>>>>>> There is a DBOutputFormat class in the
>>>>> org.apache,hadoop.mapreduce.lib.db
>>>>>>>> package, you could use that. Or you could write to the hdfs and

>>>>>>>>then
>>>>> use
>>>>>>>> something like HIHO[1] to export to the db. I have been working
>>>>>>> extensively
>>>>>>>> in this area, you can write to me directly if you need any
help.
>>>>>>>>=20
>>>>>>>> 1. https://github.com/sonalgoyal/hiho
>>>>>>>>=20
>>>>>>>> Best Regards,
>>>>>>>> Sonal
>>>>>>>> Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
>>>>>>>> Nube Technologies <http://www.nubetech.co>
>>>>>>>>=20
>>>>>>>> <http://in.linkedin.com/in/sonalgoyal>
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> On Fri, Sep 16, 2011 at 10:55 AM, Steinmaurer Thomas <=20
>>>>>>>> Thomas.Steinmaurer@scch.at> wrote:
>>>>>>>>=20
>>>>>>>>> Hello,
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> writing a MR-Job to process HBase data and store aggregated=20
>>>>>>>>>data in  Oracle. How would you do that in a MR-job?
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> Currently, for test purposes we write the result into a HBase=20
>>>>>>>>>table  again by using a TableReducer. Is there something like a
>>>>> OracleReducer,
>>>>>>>>> RelationalReducer, JDBCReducer or whatever? Or should one=20
>>>>>>>>>simply use  plan JDBC code in the reduce step?
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> Thanks!
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>> Thomas
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>=20
>>>>>=20
>>>>>=20
>>> 		 	   		 =20
>>
>