hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jean-Daniel Cryans <jdcry...@apache.org>
Subject Re: Hbase-Hive integration performance issues
Date Mon, 19 Sep 2011 16:57:48 GMT
(replying to user@, dev@ in BCC)

AFAIK the HBase handler doesn't have the wits to understand that you
are doing a prefix scan and thus limit the scan to only the required
rows. There's a bunch of optimizations like that that need to be done.

I'm pretty sure Pig does the same thing, but don't take my word on it.

J-D

On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matthew@tovbin.com> wrote:
> Hi guys,
>
> I've got a table in Hbase let's say "tbl" and I would like to query it using
> Hive. Therefore I mapped a table to hive as follows:
>
> CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
> TBLPROPERTIES("hbase.table.name" = "tbl");
>
> Queries like: "select * from tbl", "select id from tbl", "select id, data
> from tbl" are really fast.
> But queries like "select id from tbl where substr(id, 0, 5) = "12345""
> or "select id from tbl where data["777"] IS NOT NULL" are incredibly slow.
>
> In the contrary when running from Hbase shell: "scan 'tbl', {
> COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
> "scan 'tbl', { COLUMNS=>'data', "FILTER" =>
> FilterList.new([qualifierFilter('777')])}"
> it is lightning fast!
>
> When I looked into the mapred job generated by hive on jobtracker I
> discovered that "map.input.records" counts ALL the items in Hbase table,
> meaning the job makes a full table scan before it even starts any mappers!!
> Moreover, I suspect it copies all the data from Hbase table to hdfs to
> mapper tmp input folder before executuion.
>
> So, my questions are - Why hbase storage handler for hive does not translate
> hive queries into appropriate hbase functions? Why it scans all the records
> and then slices them using "where" clause? How can it be improved? Is Pig's
> integration better in this case?
>
>
> Some additional information about the tables:
> Table description in Hbase:
> jruby-1.6.2 :011 >   describe 'tbl'
> DESCRIPTION
>              ENABLED
>  {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => 'ROWCOL',
> REPLICATIO true
>  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647',
> BLOCKSIZE =>
>  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
>
> Table desciption in Hive:
> hive> describe tbl;
> OK
> id string from deserializer
> data map<string,string> from deserializer
> Time taken: 0.08 seconds
>
> Best regards,
>   Matthew Tovbin =)
>

Mime
View raw message