hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kilbride, James P." <James.Kilbr...@gd-ais.com>
Subject RE: MapReduce HBASE examples
Date Tue, 06 Jul 2010 17:50:55 GMT
I'm assuming the rows being pulled back are smaller than the full row set of the entire database.
So say the 10 out of 2B case. But, each row has a column family who's 'columns' are actually
rowIds in the database. (basically my one to many relationship mapping). I'm not trying to
use MR for the initial get of 10 columns, but rather the fact that each of those 10 initial
rows generates potentially hundreds or thousands of other calls. 

I am trying to do this for a real time user request, but I expect the total processing to
take some time so it's more of a user initiated call. There also may be dozens of users making
the request at any given time so I want to farm this out into the MR world so that multiple
instances of the job can be running(with completely different starting rows) at any given
time. 

I could do this using a serialized local process but I explicitly want some of my processing,
which could take some time, happening out in the map reduce world to take advantage of spare
cycles elsewhere, as well as potential data locality and the fact that it is a parallelizable
problem seems to imply that M/R would be a logical way to do it.

James Kilbride

-----Original Message-----
From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, July 06, 2010 1:12 PM
To: general@hadoop.apache.org
Subject: Re: MapReduce HBASE examples

That won't be very efficient either... are you trying to do this for a real
time user request. If so, it really isn't the way you want to go.

If you are in a batch processing situation, I'd say it depends on how many
rows you have VS how many you need to retrieve eg scanning 2B rows only to
find 10 rows really doesn't make sense. How do you determine which users you
need to process? How big is your dataset? I understand that you wish to use
the MR-provided functionalities of grouping and such, but simply issuing a
bunch of Gets in parallel may just be easier to write and maintain.

J-D

On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
James.Kilbride@gd-ais.com> wrote:

> So, if that's the case, and you argument makes sense understanding how scan
> versus get works, I'd have to write a custom InputFormat class that looks
> like the TableInputFormat class, but uses a get(or series of gets) rather
> than a scan object as the current table mapper does?
>
> James Kilbride
>
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, July 06, 2010 12:53 PM
> To: general@hadoop.apache.org
> Subject: Re: MapReduce HBASE examples
>
> >
> >
> > Does this make any sense?
> >
> >
> Not in a MapReduce context, what you want to do is a LIKE with a bunch of
> values right? Since a mapper will always read all the input that it's given
> (minus some filters like you can do with HBase), whatever you do will
> always
> end up being a full table scan. You "could" solve your problem by
> configuring your Scan object with a RowFilter that knows about the names
> you
> are looking for, but that still ends up being a full scan on the region
> server side so it will be slow and will generate a lot of IO.
>
> WRT examples, HBase ships with a couple of utility classes that can also be
> used as examples. The Export class has the Scan configuration stuff:
>
> http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
>
> J-D
>

Mime
View raw message