hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Raghava Mutharaju <m.vijayaragh...@gmail.com>
Subject Re: multiple reads from a Map - optimization question
Date Wed, 23 Jun 2010 17:16:51 GMT
Hello JD,

Thank you for the response.

>>> Are the Gets done on the same row that is mapped? Or on the same table?
Or another table?
    By row that is mapped, does it mean the row that is given to the map()
method as a <K,V> pair? Then no, the data from this row is used to construct
a filter which is applied on another table. No, it is not on the same table
that this row has come from.

>>> Can you give a real example of what you are trying to achieve?
It is similar to a rule engine. I have to take in the data, apply some rules
on it and generate new data. These rules can be taken as "if..then"
statements with multiple conditions in "if". I have to check which subset of
data satisfies these conditions to apply the "then" part.
Eg: Transitive property. if (A<B and B<C and C<D) then A<D

For implementing this I am using multiple filters. For the initial scan
which forms the InputSplit to the maps, I put in the first filter (say
something like get all the values which are > A). Then in the map, I would
take in the values (say B) and for each value, I have to put in 2 more
filters
Filter-1: Find all values (say C) which are greater than the B in above
step.
Filter-2: For each value obtained as output (which is designated as C) of
Filter-1, find values greater than D.
Now take the output of Filter-2 and write it out into a table.

Since there are multiple reads involved with each row received by the map,
it is slow. Is there any way to improve the speed? or is this type of
problem not suitable for HBase/Hadoop?

Regards,
Raghava.

On Wed, Jun 23, 2010 at 12:40 PM, Jean-Daniel Cryans <jdcryans@apache.org>wrote:

> I'm still confused by 2 things:
>
>  - Are the Gets done on the same row that is mapped? Or on the same
> table? Or another table?
>  - Can you give a real example of what you are trying to achieve?
> (with fake data)
>
> Thx
>
> J-D
>
> On Tue, Jun 22, 2010 at 10:51 AM, Raghava Mutharaju
> <m.vijayaraghava@gmail.com> wrote:
> >>>> This is not super clear, some comments inline.
> > I will try & explain better this time.
> >
> > The overall objective -- from the complete dataset, obtain a subset of it
> to
> > work on. Now this subset would be obtained by making use of the 2-3
> > conditions (filters). The setting up of one filter depends on the output
> of
> > the previous filter. It is as follows
> >
> > Filter-1: Setup with the scan that is used for the map.
> > Filter-2: From the row that is coming into the map, extract some fields
> and
> > create a ColumnFilter/ValueFilter out of it. Row would be a delimited set
> of
> > values.
> > Filter-3: Apply filter-2 and from its output, extract the required fields
> > and do some processing. Then write it back to HBase table.
> >
> > Filters-2, 3 are used within the map. So I am using 1-2 Gets per row that
> > map receives. I cannot apply all the filters beforehand because the
> > subsequent filters have to be created based on previous filter's output.
> >
> > Yes, there would be more data. But currently, I am testing on data which
> > occupied only a single region. So only 1 map would be running on the
> cluster
> > and it is taking in all the data.
> >
> > This approach is slow and it shows in the results. Is there anyway, in
> which
> > this can be achieved with much improved performance?
> >
> > Thank you.
> >
> > Regards,
> > Raghava.
> >
> > On Tue, Jun 22, 2010 at 12:57 PM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >
> >> This is not super clear, some comments inline.
> >>
> >> J-D
> >>
> >> On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju
> >> <m.vijayaraghava@gmail.com> wrote:
> >> > Hello all,
> >> >
> >> >      In the data, I have to check for multiple conditions and then
> work
> >> > with the data that satisfies all the conditions. I am doing this as an
> MR
> >> > job with no reduce and the conditions are translated to a set of
> filters.
> >> > Among the multiple conditions (2 or 3 max), data that satisfies one of
> >> them
> >> > would come as input to the Map (initial filter is set in the scan to
> the
> >> > mappers). Now, from among the dataset that comes through to each map,
> I
> >> > would check for other conditions (1 or 2 remaining conditions). Since
> >> map()
> >> > is called for each row of data, it would mean 1 or 2 read calls (with
> >> > filter) to HBase tables. This setup, even for small data (data would
> fit
> >> in
> >>
> >> Here you talk about checking 1-2 two conditions... are they checked on
> >> the row that was mapped? Else that means that you are doing 1-2 Get
> >> per row? If so, this is definitely going to be slow!
> >>
> >> > a region and so only 1 map is taking in all the data) is very slow.
> >>
> >> What do you mean? That currently your test is done on 1 region but you
> >> expect more? If not, then don't use MR since that would give you
> >> nothing more than more code to write and more processing time.
> >>
> >> >
> >> > Here, note that, I shouldn't be filtering the incoming data to map but
> >> based
> >> > on that data, next set of filtering conditions would be formed.
> >>
> >> Can you give an example?
> >>
> >> >
> >> > Can this be improved? Would constructing secondary indexes help (would
> >> need
> >> > a dramatic improvement actually)? Or is this type of problem not
> suitable
> >> > for HBase?
> >> >
> >> > Thank you.
> >> >
> >> > Regards,
> >> > Raghava.
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message