hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: multiple reads from a Map - optimization question
Date Wed, 23 Jun 2010 15:27:51 GMT
any advice on this one?

Raghava.

On Tue, Jun 22, 2010 at 1:51 PM, Raghava Mutharaju <
m.vijayaraghava@gmail.com> wrote:

> >>> This is not super clear, some comments inline.
> I will try & explain better this time.
>
> The overall objective -- from the complete dataset, obtain a subset of it
> to work on. Now this subset would be obtained by making use of the 2-3
> conditions (filters). The setting up of one filter depends on the output of
> the previous filter. It is as follows
>
> Filter-1: Setup with the scan that is used for the map.
> Filter-2: From the row that is coming into the map, extract some fields and
> create a ColumnFilter/ValueFilter out of it. Row would be a delimited set of
> values.
> Filter-3: Apply filter-2 and from its output, extract the required fields
> and do some processing. Then write it back to HBase table.
>
> Filters-2, 3 are used within the map. So I am using 1-2 Gets per row that
> map receives. I cannot apply all the filters beforehand because the
> subsequent filters have to be created based on previous filter's output.
>
> Yes, there would be more data. But currently, I am testing on data which
> occupied only a single region. So only 1 map would be running on the cluster
> and it is taking in all the data.
>
> This approach is slow and it shows in the results. Is there anyway, in
> which this can be achieved with much improved performance?
>
> Thank you.
>
> Regards,
> Raghava.
>
>
> On Tue, Jun 22, 2010 at 12:57 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:
>
>> This is not super clear, some comments inline.
>>
>> J-D
>>
>> On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju
>> <m.vijayaraghava@gmail.com> wrote:
>> > Hello all,
>> >
>> >      In the data, I have to check for multiple conditions and then work
>> > with the data that satisfies all the conditions. I am doing this as an
>> MR
>> > job with no reduce and the conditions are translated to a set of
>> filters.
>> > Among the multiple conditions (2 or 3 max), data that satisfies one of
>> them
>> > would come as input to the Map (initial filter is set in the scan to the
>> > mappers). Now, from among the dataset that comes through to each map, I
>> > would check for other conditions (1 or 2 remaining conditions). Since
>> map()
>> > is called for each row of data, it would mean 1 or 2 read calls (with
>> > filter) to HBase tables. This setup, even for small data (data would fit
>> in
>>
>> Here you talk about checking 1-2 two conditions... are they checked on
>> the row that was mapped? Else that means that you are doing 1-2 Get
>> per row? If so, this is definitely going to be slow!
>>
>> > a region and so only 1 map is taking in all the data) is very slow.
>>
>> What do you mean? That currently your test is done on 1 region but you
>> expect more? If not, then don't use MR since that would give you
>> nothing more than more code to write and more processing time.
>>
>> >
>> > Here, note that, I shouldn't be filtering the incoming data to map but
>> based
>> > on that data, next set of filtering conditions would be formed.
>>
>> Can you give an example?
>>
>> >
>> > Can this be improved? Would constructing secondary indexes help (would
>> need
>> > a dramatic improvement actually)? Or is this type of problem not
>> suitable
>> > for HBase?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Raghava.
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message