Mailing-List: contact dev-help@hbase.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hbase.apache.org
Received-SPF: pass (athena.apache.org: domain of tovbinm@gmail.com designates
 209.85.210.169 as permitted sender)
MIME-Version: 1.0
Sender: tovbinm@gmail.com
From: Matthew Tovbin <matthew@tovbin.com>
Date: Sun, 18 Sep 2011 14:09:34 +0300
Message-ID: 
 <CADSCZ4RoJrs_66mT_Jj3PdE1PSdvpKtroHxDFWqknBxxP8WnLQ@mail.gmail.com>
Subject: Hbase-Hive integration performance issues
To: user@hbase.apache.org, dev@hbase.apache.org
Content-Type: multipart/alternative; boundary=00151774158617535504ad3544de

--00151774158617535504ad3544de
Content-Type: text/plain; charset=ISO-8859-1

Hi guys,

I've got a table in Hbase let's say "tbl" and I would like to query it using
Hive. Therefore I mapped a table to hive as follows:

CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
TBLPROPERTIES("hbase.table.name" = "tbl");

Queries like: "select * from tbl", "select id from tbl", "select id, data
from tbl" are really fast.
But queries like "select id from tbl where substr(id, 0, 5) = "12345""
or "select id from tbl where data["777"] IS NOT NULL" are incredibly slow.

In the contrary when running from Hbase shell: "scan 'tbl', {
COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan 'tbl', { COLUMNS=>'data', "FILTER" =>
FilterList.new([qualifierFilter('777')])}"
it is lightning fast!

When I looked into the mapred job generated by hive on jobtracker I
discovered that "map.input.records" counts ALL the items in Hbase table,
meaning the job makes a full table scan before it even starts any mappers!!
Moreover, I suspect it copies all the data from Hbase table to hdfs to
mapper tmp input folder before executuion.

So, my questions are - Why hbase storage handler for hive does not translate
hive queries into appropriate hbase functions? Why it scans all the records
and then slices them using "where" clause? How can it be improved? Is Pig's
integration better in this case?


Some additional information about the tables:
Table description in Hbase:
jruby-1.6.2 :011 >   describe 'tbl'
DESCRIPTION
              ENABLED
 {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER => 'ROWCOL',
REPLICATIO true
 N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL => '2147483647',
BLOCKSIZE =>
  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

Table desciption in Hive:
hive> describe tbl;
OK
id string from deserializer
data map<string,string> from deserializer
Time taken: 0.08 seconds


Best regards,
   Matthew Tovbin =)

--00151774158617535504ad3544de--