hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Purtell <apurt...@apache.org>
Subject Re: Hbase-Hive integration performance issues
Date Fri, 30 Sep 2011 16:49:34 GMT
I believe this is the latest status:

    https://issues.apache.org/jira/browse/HIVE-1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Suggest following up to dev@hive.apache.org and/or user@hive.apache.org.

Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


>________________________________
>From: Matthew Tovbin <matthew@tovbin.com>
>To: HBase User <user@hbase.apache.org>
>Cc: Hbase Dev <dev@hbase.apache.org>
>Sent: Friday, September 30, 2011 5:49 AM
>Subject: Re: Hbase-Hive integration performance issues
>
>Hello guys,
>
>Any updates on the issue? Anyone?! ;))
>
>Best regards,
>   Matthew Tovbin =)
>
>
>
>On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <matthew@tovbin.com> wrote:
>
>>  Thanks Jean and Sandy.
>>
>>    I have hive 0.7.1, and according to this patch
>> https://issues.apache.org/jira/browse/HIVE-1226 at least exact match
>> queries like  "...where id = '12345'-123' " or partial pushdown "...where id
>> like "12345%" should work, but I didn't notice it.
>>
>> Matthew.
>>
>>
>>
>> On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <prattrs@adobe.com> wrote:
>>
>>> I suffered the same let down a little while ago.  I believe this is the
>>> relevant JIRA:
>>>
>>> https://issues.apache.org/jira/browse/HIVE-1643
>>>
>>> I'd also like to see Hive be able to limit scans to particular HBase
>>> version ranges, but I don't know if that's even planned.
>>>
>>> Sandy
>>>
>>> > -----Original Message-----
>>> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
>>> > Daniel Cryans
>>> > Sent: Monday, September 19, 2011 09:58
>>> > To: user@hbase.apache.org
>>> > Subject: Re: Hbase-Hive integration performance issues
>>> >
>>> > (replying to user@, dev@ in BCC)
>>> >
>>> > AFAIK the HBase handler doesn't have the wits to understand that you are
>>> > doing a prefix scan and thus limit the scan to only the required rows.
>>> There's
>>> > a bunch of optimizations like that that need to be done.
>>> >
>>> > I'm pretty sure Pig does the same thing, but don't take my word on it.
>>> >
>>> > J-D
>>> >
>>> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matthew@tovbin.com>
>>> > wrote:
>>> > > Hi guys,
>>> > >
>>> > > I've got a table in Hbase let's say "tbl" and I would like to query
it
>>> > > using Hive. Therefore I mapped a table to hive as follows:
>>> > >
>>> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED
>>> > > BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
>>> > > TBLPROPERTIES("hbase.table.name" = "tbl");
>>> > >
>>> > > Queries like: "select * from tbl", "select id from tbl", "select id,
>>> > > data from tbl" are really fast.
>>> > > But queries like "select id from tbl where substr(id, 0, 5) = "12345""
>>> > > or "select id from tbl where data["777"] IS NOT NULL" are incredibly
>>> slow.
>>> > >
>>> > > In the contrary when running from Hbase shell: "scan 'tbl', {
>>> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl',
{
>>> > > COLUMNS=>'data', "FILTER" =>
>>> > > FilterList.new([qualifierFilter('777')])}"
>>> > > it is lightning fast!
>>> > >
>>> > > When I looked into the mapred job generated by hive on jobtracker I
>>> > > discovered that "map.input.records" counts ALL the items in Hbase
>>> > > table, meaning the job makes a full table scan before it even starts
>>> any
>>> > mappers!!
>>> > > Moreover, I suspect it copies all the data from Hbase table to hdfs
to
>>> > > mapper tmp input folder before executuion.
>>> > >
>>> > > So, my questions are - Why hbase storage handler for hive does not
>>> > > translate hive queries into appropriate hbase functions? Why it scans
>>> > > all the records and then slices them using "where" clause? How can
it
>>> > > be improved? Is Pig's integration better in this case?
>>> > >
>>> > >
>>> > > Some additional information about the tables:
>>> > > Table description in Hbase:
>>> > > jruby-1.6.2 :011 >   describe 'tbl'
>>> > > DESCRIPTION
>>> > >              ENABLED
>>> > >  {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER
=>
>>> > > 'ROWCOL', REPLICATIO true
>>> > >  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL
=>
>>> > > '2147483647', BLOCKSIZE =>
>>> > >  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>>> > >
>>> > > Table desciption in Hive:
>>> > > hive> describe tbl;
>>> > > OK
>>> > > id string from deserializer
>>> > > data map<string,string> from deserializer Time taken: 0.08 seconds
>>> > >
>>> > > Best regards,
>>> > >   Matthew Tovbin =)
>>> > >
>>>
>>
>>
>
>
>
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message