hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Tovbin <matt...@tovbin.com>
Subject Re: Hbase-Hive integration performance issues
Date Tue, 20 Sep 2011 06:41:02 GMT
 Thanks Jean and Sandy.

   I have hive 0.7.1, and according to this patch
https://issues.apache.org/jira/browse/HIVE-1226 at least exact match queries
like  "...where id = '12345'-123' " or partial pushdown "...where id like
"12345%" should work, but I didn't notice it.

Matthew.



On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <prattrs@adobe.com> wrote:

> I suffered the same let down a little while ago.  I believe this is the
> relevant JIRA:
>
> https://issues.apache.org/jira/browse/HIVE-1643
>
> I'd also like to see Hive be able to limit scans to particular HBase
> version ranges, but I don't know if that's even planned.
>
> Sandy
>
> > -----Original Message-----
> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
> > Daniel Cryans
> > Sent: Monday, September 19, 2011 09:58
> > To: user@hbase.apache.org
> > Subject: Re: Hbase-Hive integration performance issues
> >
> > (replying to user@, dev@ in BCC)
> >
> > AFAIK the HBase handler doesn't have the wits to understand that you are
> > doing a prefix scan and thus limit the scan to only the required rows.
> There's
> > a bunch of optimizations like that that need to be done.
> >
> > I'm pretty sure Pig does the same thing, but don't take my word on it.
> >
> > J-D
> >
> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin <matthew@tovbin.com>
> > wrote:
> > > Hi guys,
> > >
> > > I've got a table in Hbase let's say "tbl" and I would like to query it
> > > using Hive. Therefore I mapped a table to hive as follows:
> > >
> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>) STORED
> > > BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
> > > TBLPROPERTIES("hbase.table.name" = "tbl");
> > >
> > > Queries like: "select * from tbl", "select id from tbl", "select id,
> > > data from tbl" are really fast.
> > > But queries like "select id from tbl where substr(id, 0, 5) = "12345""
> > > or "select id from tbl where data["777"] IS NOT NULL" are incredibly
> slow.
> > >
> > > In the contrary when running from Hbase shell: "scan 'tbl', {
> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan 'tbl', {
> > > COLUMNS=>'data', "FILTER" =>
> > > FilterList.new([qualifierFilter('777')])}"
> > > it is lightning fast!
> > >
> > > When I looked into the mapred job generated by hive on jobtracker I
> > > discovered that "map.input.records" counts ALL the items in Hbase
> > > table, meaning the job makes a full table scan before it even starts
> any
> > mappers!!
> > > Moreover, I suspect it copies all the data from Hbase table to hdfs to
> > > mapper tmp input folder before executuion.
> > >
> > > So, my questions are - Why hbase storage handler for hive does not
> > > translate hive queries into appropriate hbase functions? Why it scans
> > > all the records and then slices them using "where" clause? How can it
> > > be improved? Is Pig's integration better in this case?
> > >
> > >
> > > Some additional information about the tables:
> > > Table description in Hbase:
> > > jruby-1.6.2 :011 >   describe 'tbl'
> > > DESCRIPTION
> > >              ENABLED
> > >  {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER =>
> > > 'ROWCOL', REPLICATIO true
> > >  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3', TTL =>
> > > '2147483647', BLOCKSIZE =>
> > >  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >
> > > Table desciption in Hive:
> > > hive> describe tbl;
> > > OK
> > > id string from deserializer
> > > data map<string,string> from deserializer Time taken: 0.08 seconds
> > >
> > > Best regards,
> > >   Matthew Tovbin =)
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message