Mailing-List: contact user-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of jdcryans@gmail.com designates
 209.85.216.169 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:date
         :x-google-sender-auth:message-id:subject:from:to:content-type
         :content-transfer-encoding;
        b=MQ0r/1tzpk6XPOsH7bl7PfwnvQ4v6zDfdxMDoDWDlUSsYpUoC+2Tsa4UcDvNnx1Hyj
         66a2VJd47CG62DcfNpwCYhCG5MZ8K8QiZzIfEO98PFSPXpQNfOOr1k8Rlbs4YumBlMwX
         jI2ZVX3qUPfgoM30NR/Swiq9pye+RQ36NyxKQ=
MIME-Version: 1.0
Sender: jdcryans@gmail.com
In-Reply-To: <AANLkTindlbgX4_1-YIGpJXlR5YBUC4yUlj7Jjt8h8cuF@mail.gmail.com>
References: <AANLkTinG2yVyUvE-YdSV3b8Org_4ITC4Bc28C2dFLkx3@mail.gmail.com>
	<AANLkTinWwaphk8HtTBbs2Zy3D6KdHLRl6VQ9iWv21Gm4@mail.gmail.com>
	<AANLkTindlbgX4_1-YIGpJXlR5YBUC4yUlj7Jjt8h8cuF@mail.gmail.com>
Date: Wed, 23 Jun 2010 09:40:14 -0700
Message-ID: <AANLkTill6x63O1ENp1yj_ZkuEyTKevhsdayGkBXheLhy@mail.gmail.com>
Subject: Re: multiple reads from a Map - optimization question
From: Jean-Daniel Cryans <jdcryans@apache.org>
To: user@hbase.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I'm still confused by 2 things:

 - Are the Gets done on the same row that is mapped? Or on the same
table? Or another table?
 - Can you give a real example of what you are trying to achieve?
(with fake data)

Thx

J-D

On Tue, Jun 22, 2010 at 10:51 AM, Raghava Mutharaju
<m.vijayaraghava@gmail.com> wrote:
>>>> This is not super clear, some comments inline.
> I will try & explain better this time.
>
> The overall objective -- from the complete dataset, obtain a subset of it=
 to
> work on. Now this subset would be obtained by making use of the 2-3
> conditions (filters). The setting up of one filter depends on the output =
of
> the previous filter. It is as follows
>
> Filter-1: Setup with the scan that is used for the map.
> Filter-2: From the row that is coming into the map, extract some fields a=
nd
> create a ColumnFilter/ValueFilter out of it. Row would be a delimited set=
 of
> values.
> Filter-3: Apply filter-2 and from its output, extract the required fields
> and do some processing. Then write it back to HBase table.
>
> Filters-2, 3 are used within the map. So I am using 1-2 Gets per row that
> map receives. I cannot apply all the filters beforehand because the
> subsequent filters have to be created based on previous filter's output.
>
> Yes, there would be more data. But currently, I am testing on data which
> occupied only a single region. So only 1 map would be running on the clus=
ter
> and it is taking in all the data.
>
> This approach is slow and it shows in the results. Is there anyway, in wh=
ich
> this can be achieved with much improved performance?
>
> Thank you.
>
> Regards,
> Raghava.
>
> On Tue, Jun 22, 2010 at 12:57 PM, Jean-Daniel Cryans <jdcryans@apache.org=
>wrote:
>
>> This is not super clear, some comments inline.
>>
>> J-D
>>
>> On Tue, Jun 22, 2010 at 12:49 AM, Raghava Mutharaju
>> <m.vijayaraghava@gmail.com> wrote:
>> > Hello all,
>> >
>> > =A0 =A0 =A0In the data, I have to check for multiple conditions and th=
en work
>> > with the data that satisfies all the conditions. I am doing this as an=
 MR
>> > job with no reduce and the conditions are translated to a set of filte=
rs.
>> > Among the multiple conditions (2 or 3 max), data that satisfies one of
>> them
>> > would come as input to the Map (initial filter is set in the scan to t=
he
>> > mappers). Now, from among the dataset that comes through to each map, =
I
>> > would check for other conditions (1 or 2 remaining conditions). Since
>> map()
>> > is called for each row of data, it would mean 1 or 2 read calls (with
>> > filter) to HBase tables. This setup, even for small data (data would f=
it
>> in
>>
>> Here you talk about checking 1-2 two conditions... are they checked on
>> the row that was mapped? Else that means that you are doing 1-2 Get
>> per row? If so, this is definitely going to be slow!
>>
>> > a region and so only 1 map is taking in all the data) is very slow.
>>
>> What do you mean? That currently your test is done on 1 region but you
>> expect more? If not, then don't use MR since that would give you
>> nothing more than more code to write and more processing time.
>>
>> >
>> > Here, note that, I shouldn't be filtering the incoming data to map but
>> based
>> > on that data, next set of filtering conditions would be formed.
>>
>> Can you give an example?
>>
>> >
>> > Can this be improved? Would constructing secondary indexes help (would
>> need
>> > a dramatic improvement actually)? Or is this type of problem not suita=
ble
>> > for HBase?
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Raghava.
>> >
>>
>