hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: multiple reads from a Map - optimization question
Date Tue, 22 Jun 2010 17:51:09 GMT
>>> This is not super clear, some comments inline.
I will try & explain better this time.

The overall objective -- from the complete dataset, obtain a subset of it to
work on. Now this subset would be obtained by making use of the 2-3
conditions (filters). The setting up of one filter depends on the output of
the previous filter. It is as follows

Filter-1: Setup with the scan that is used for the map.
Filter-2: From the row that is coming into the map, extract some fields and
create a ColumnFilter/ValueFilter out of it. Row would be a delimited set of
values.
Filter-3: Apply filter-2 and from its output, extract the required fields
and do some processing. Then write it back to HBase table.

Filters-2, 3 are used within the map. So I am using 1-2 Gets per row that
map receives. I cannot apply all the filters beforehand because the
subsequent filters have to be created based on previous filter's output.

Yes, there would be more data. But currently, I am testing on data which
occupied only a single region. So only 1 map would be running on the cluster
and it is taking in all the data.

This approach is slow and it shows in the results. Is there anyway, in which
this can be achieved with much improved performance?

Thank you.

Regards,
Raghava.

On Tue, Jun 22, 2010 at 12:57 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> This is not super clear, some comments inline.
>
> J-D
>
> On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju
> <m.vijayaraghava@gmail.com> wrote:
> > Hello all,
> >
> >      In the data, I have to check for multiple conditions and then work
> > with the data that satisfies all the conditions. I am doing this as an MR
> > job with no reduce and the conditions are translated to a set of filters.
> > Among the multiple conditions (2 or 3 max), data that satisfies one of
> them
> > would come as input to the Map (initial filter is set in the scan to the
> > mappers). Now, from among the dataset that comes through to each map, I
> > would check for other conditions (1 or 2 remaining conditions). Since
> map()
> > is called for each row of data, it would mean 1 or 2 read calls (with
> > filter) to HBase tables. This setup, even for small data (data would fit
> in
>
> Here you talk about checking 1-2 two conditions... are they checked on
> the row that was mapped? Else that means that you are doing 1-2 Get
> per row? If so, this is definitely going to be slow!
>
> > a region and so only 1 map is taking in all the data) is very slow.
>
> What do you mean? That currently your test is done on 1 region but you
> expect more? If not, then don't use MR since that would give you
> nothing more than more code to write and more processing time.
>
> >
> > Here, note that, I shouldn't be filtering the incoming data to map but
> based
> > on that data, next set of filtering conditions would be formed.
>
> Can you give an example?
>
> >
> > Can this be improved? Would constructing secondary indexes help (would
> need
> > a dramatic improvement actually)? Or is this type of problem not suitable
> > for HBase?
> >
> > Thank you.
> >
> > Regards,
> > Raghava.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message