hbase-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sandy Pratt <prat...@adobe.com>
Subject RE: Hbase-Hive integration performance issues
Date Mon, 03 Oct 2011 20:36:58 GMT
I've been working on this issue lately.  I am beginning to deploy a modified version of the
stock HBase serde to my own clusters.  For one thing, it contains the code to push down scan
ranges to HBase (see jira), and I've also adapted it to read my single-cell protobuf records
via reflection.  Once I've tested it on larger datasets, I'll see about getting something
together than I can submit to back to Hive.  But for now the patch I posted on Jira should
apply to trunk (and also cdh3u0, which I use, I think) and allow range scans on the rowkey
to be pushed down (if it doesn't please let me know ;) ).

Sandy

> -----Original Message-----
> From: Andrew Purtell [mailto:apurtell@apache.org]
> Sent: Friday, September 30, 2011 09:50
> To: dev@hbase.apache.org; HBase User
> Subject: Re: Hbase-Hive integration performance issues
> 
> I believe this is the latest status:
> 
>     https://issues.apache.org/jira/browse/HIVE-
> 1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> 
> Suggest following up to dev@hive.apache.org and/or user@hive.apache.org.
> 
> Best regards,
> 
> 
>    - Andy
> 
> Problems worthy of attack prove their worth by hitting back. - Piet Hein (via
> Tom White)
> 
> 
> >________________________________
> >From: Matthew Tovbin <matthew@tovbin.com>
> >To: HBase User <user@hbase.apache.org>
> >Cc: Hbase Dev <dev@hbase.apache.org>
> >Sent: Friday, September 30, 2011 5:49 AM
> >Subject: Re: Hbase-Hive integration performance issues
> >
> >Hello guys,
> >
> >Any updates on the issue? Anyone?! ;))
> >
> >Best regards,
> >   Matthew Tovbin =)
> >
> >
> >
> >On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <matthew@tovbin.com>
> wrote:
> >
> >>  Thanks Jean and Sandy.
> >>
> >>    I have hive 0.7.1, and according to this patch
> >> https://issues.apache.org/jira/browse/HIVE-1226 at least exact match
> >>queries like  "...where id = '12345'-123' " or partial pushdown
> >>"...where id  like "12345%" should work, but I didn't notice it.
> >>
> >> Matthew.
> >>
> >>
> >>
> >> On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <prattrs@adobe.com> wrote:
> >>
> >>> I suffered the same let down a little while ago.  I believe this is
> >>> the relevant JIRA:
> >>>
> >>> https://issues.apache.org/jira/browse/HIVE-1643
> >>>
> >>> I'd also like to see Hive be able to limit scans to particular HBase
> >>> version ranges, but I don't know if that's even planned.
> >>>
> >>> Sandy
> >>>
> >>> > -----Original Message-----
> >>> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> >>> > Jean- Daniel Cryans
> >>> > Sent: Monday, September 19, 2011 09:58
> >>> > To: user@hbase.apache.org
> >>> > Subject: Re: Hbase-Hive integration performance issues
> >>> >
> >>> > (replying to user@, dev@ in BCC)
> >>> >
> >>> > AFAIK the HBase handler doesn't have the wits to understand that
> >>> > you are doing a prefix scan and thus limit the scan to only the required
> rows.
> >>> There's
> >>> > a bunch of optimizations like that that need to be done.
> >>> >
> >>> > I'm pretty sure Pig does the same thing, but don't take my word on
it.
> >>> >
> >>> > J-D
> >>> >
> >>> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin
> >>> > <matthew@tovbin.com>
> >>> > wrote:
> >>> > > Hi guys,
> >>> > >
> >>> > > I've got a table in Hbase let's say "tbl" and I would like to
> >>> > > query it using Hive. Therefore I mapped a table to hive as follows:
> >>> > >
> >>> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
> >>> > > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> >>> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
> >>> > > TBLPROPERTIES("hbase.table.name" = "tbl");
> >>> > >
> >>> > > Queries like: "select * from tbl", "select id from tbl", "select
> >>> > > id, data from tbl" are really fast.
> >>> > > But queries like "select id from tbl where substr(id, 0, 5) =
"12345""
> >>> > > or "select id from tbl where data["777"] IS NOT NULL" are
> >>> > > incredibly
> >>> slow.
> >>> > >
> >>> > > In the contrary when running from Hbase shell: "scan 'tbl', {
> >>> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or "scan
> >>> > > 'tbl', { COLUMNS=>'data', "FILTER" =>
> >>> > > FilterList.new([qualifierFilter('777')])}"
> >>> > > it is lightning fast!
> >>> > >
> >>> > > When I looked into the mapred job generated by hive on
> >>> > > jobtracker I discovered that "map.input.records" counts ALL the
> >>> > > items in Hbase table, meaning the job makes a full table scan
> >>> > > before it even starts
> >>> any
> >>> > mappers!!
> >>> > > Moreover, I suspect it copies all the data from Hbase table to
> >>> > > hdfs to mapper tmp input folder before executuion.
> >>> > >
> >>> > > So, my questions are - Why hbase storage handler for hive does
> >>> > > not translate hive queries into appropriate hbase functions? Why
> >>> > > it scans all the records and then slices them using "where"
> >>> > > clause? How can it be improved? Is Pig's integration better in
this
> case?
> >>> > >
> >>> > >
> >>> > > Some additional information about the tables:
> >>> > > Table description in Hbase:
> >>> > > jruby-1.6.2 :011 >   describe 'tbl'
> >>> > > DESCRIPTION
> >>> > >              ENABLED
> >>> > >  {NAME => 'users', FAMILIES => [{NAME => 'data', BLOOMFILTER
=>
> >>> > >'ROWCOL', REPLICATIO true
> >>> > >  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '3',
TTL =>
> >>> > >'2147483647', BLOCKSIZE =>
> >>> > >  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> >>> > >
> >>> > > Table desciption in Hive:
> >>> > > hive> describe tbl;
> >>> > > OK
> >>> > > id string from deserializer
> >>> > > data map<string,string> from deserializer Time taken: 0.08
> >>> > > seconds
> >>> > >
> >>> > > Best regards,
> >>> > >   Matthew Tovbin =)
> >>> > >
> >>>
> >>
> >>
> >
> >
> >

Mime
View raw message