hadoop-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: MapReduce HBASE examples
Date Tue, 06 Jul 2010 21:39:14 GMT
(moving the thread to the HBase user mailing list, on reply please remove
the general@ since this is not a general question)

It is indeed a parallelizable problem that could use a job management
system, but in your case I don't think MR is the right solution. You will
have to do all sorts weird tweaks and in the end you won't get much out of
it since you basically want to process a tiny portion of the whole dataset.
You also talk about possible localisation, but I don't see that being
a particularly strong argument in what you describe. Yes, you could start
one mapper per region that contains some of the rows you are looking for,
but the cost of starting and managing those JVMs is high compared to just
starting one that does the work (since it can be done easily in a single
process that can be multi-threaded).

To sum up, using MR on a small dataset is basically having all the
disadvantages for almost none of the advantages.

Instead you could look into running Gearman (or similar) on those machines
and that would give you exactly what you need IMHO.


On Tue, Jul 6, 2010 at 10:50 AM, Kilbride, James P. <
James.Kilbride@gd-ais.com> wrote:

> I'm assuming the rows being pulled back are smaller than the full row set
> of the entire database. So say the 10 out of 2B case. But, each row has a
> column family who's 'columns' are actually rowIds in the database.
> (basically my one to many relationship mapping). I'm not trying to use MR
> for the initial get of 10 columns, but rather the fact that each of those 10
> initial rows generates potentially hundreds or thousands of other calls.
> I am trying to do this for a real time user request, but I expect the total
> processing to take some time so it's more of a user initiated call. There
> also may be dozens of users making the request at any given time so I want
> to farm this out into the MR world so that multiple instances of the job can
> be running(with completely different starting rows) at any given time.
> I could do this using a serialized local process but I explicitly want some
> of my processing, which could take some time, happening out in the map
> reduce world to take advantage of spare cycles elsewhere, as well as
> potential data locality and the fact that it is a parallelizable problem
> seems to imply that M/R would be a logical way to do it.
> James Kilbride
> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-Daniel Cryans
> Sent: Tuesday, July 06, 2010 1:12 PM
> To: general@hadoop.apache.org
> Subject: Re: MapReduce HBASE examples
> That won't be very efficient either... are you trying to do this for a real
> time user request. If so, it really isn't the way you want to go.
> If you are in a batch processing situation, I'd say it depends on how many
> rows you have VS how many you need to retrieve eg scanning 2B rows only to
> find 10 rows really doesn't make sense. How do you determine which users
> you
> need to process? How big is your dataset? I understand that you wish to use
> the MR-provided functionalities of grouping and such, but simply issuing a
> bunch of Gets in parallel may just be easier to write and maintain.
> J-D
> On Tue, Jul 6, 2010 at 10:02 AM, Kilbride, James P. <
> James.Kilbride@gd-ais.com> wrote:
> > So, if that's the case, and you argument makes sense understanding how
> scan
> > versus get works, I'd have to write a custom InputFormat class that looks
> > like the TableInputFormat class, but uses a get(or series of gets) rather
> > than a scan object as the current table mapper does?
> >
> > James Kilbride
> >
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> > Jean-Daniel Cryans
> > Sent: Tuesday, July 06, 2010 12:53 PM
> > To: general@hadoop.apache.org
> > Subject: Re: MapReduce HBASE examples
> >
> > >
> > >
> > > Does this make any sense?
> > >
> > >
> > Not in a MapReduce context, what you want to do is a LIKE with a bunch of
> > values right? Since a mapper will always read all the input that it's
> given
> > (minus some filters like you can do with HBase), whatever you do will
> > always
> > end up being a full table scan. You "could" solve your problem by
> > configuring your Scan object with a RowFilter that knows about the names
> > you
> > are looking for, but that still ends up being a full scan on the region
> > server side so it will be slow and will generate a lot of IO.
> >
> > WRT examples, HBase ships with a couple of utility classes that can also
> be
> > used as examples. The Export class has the Scan configuration stuff:
> >
> >
> http://github.com/apache/hbase/blob/0.20/src/java/org/apache/hadoop/hbase/mapreduce/Export.java
> >
> > J-D
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message