hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthew Tovbin <matt...@tovbin.com>
Subject Re: Hbase-Hive integration performance issues
Date Mon, 03 Oct 2011 20:42:46 GMT
Thanks Sandy, I'll try it too!

Best regards,
   Matthew Tovbin =)



On Mon, Oct 3, 2011 at 22:36, Sandy Pratt <prattrs@adobe.com> wrote:

> I've been working on this issue lately.  I am beginning to deploy a
> modified version of the stock HBase serde to my own clusters.  For one
> thing, it contains the code to push down scan ranges to HBase (see jira),
> and I've also adapted it to read my single-cell protobuf records via
> reflection.  Once I've tested it on larger datasets, I'll see about getting
> something together than I can submit to back to Hive.  But for now the patch
> I posted on Jira should apply to trunk (and also cdh3u0, which I use, I
> think) and allow range scans on the rowkey to be pushed down (if it doesn't
> please let me know ;) ).
>
> Sandy
>
> > -----Original Message-----
> > From: Andrew Purtell [mailto:apurtell@apache.org]
> > Sent: Friday, September 30, 2011 09:50
> > To: dev@hbase.apache.org; HBase User
> > Subject: Re: Hbase-Hive integration performance issues
> >
> > I believe this is the latest status:
> >
> >     https://issues.apache.org/jira/browse/HIVE-
> > 1643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> >
> > Suggest following up to dev@hive.apache.org and/or user@hive.apache.org.
> >
> > Best regards,
> >
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via
> > Tom White)
> >
> >
> > >________________________________
> > >From: Matthew Tovbin <matthew@tovbin.com>
> > >To: HBase User <user@hbase.apache.org>
> > >Cc: Hbase Dev <dev@hbase.apache.org>
> > >Sent: Friday, September 30, 2011 5:49 AM
> > >Subject: Re: Hbase-Hive integration performance issues
> > >
> > >Hello guys,
> > >
> > >Any updates on the issue? Anyone?! ;))
> > >
> > >Best regards,
> > >   Matthew Tovbin =)
> > >
> > >
> > >
> > >On Tue, Sep 20, 2011 at 09:41, Matthew Tovbin <matthew@tovbin.com>
> > wrote:
> > >
> > >>  Thanks Jean and Sandy.
> > >>
> > >>    I have hive 0.7.1, and according to this patch
> > >> https://issues.apache.org/jira/browse/HIVE-1226 at least exact match
> > >>queries like  "...where id = '12345'-123' " or partial pushdown
> > >>"...where id  like "12345%" should work, but I didn't notice it.
> > >>
> > >> Matthew.
> > >>
> > >>
> > >>
> > >> On Mon, Sep 19, 2011 at 20:37, Sandy Pratt <prattrs@adobe.com> wrote:
> > >>
> > >>> I suffered the same let down a little while ago.  I believe this is
> > >>> the relevant JIRA:
> > >>>
> > >>> https://issues.apache.org/jira/browse/HIVE-1643
> > >>>
> > >>> I'd also like to see Hive be able to limit scans to particular HBase
> > >>> version ranges, but I don't know if that's even planned.
> > >>>
> > >>> Sandy
> > >>>
> > >>> > -----Original Message-----
> > >>> > From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf
Of
> > >>> > Jean- Daniel Cryans
> > >>> > Sent: Monday, September 19, 2011 09:58
> > >>> > To: user@hbase.apache.org
> > >>> > Subject: Re: Hbase-Hive integration performance issues
> > >>> >
> > >>> > (replying to user@, dev@ in BCC)
> > >>> >
> > >>> > AFAIK the HBase handler doesn't have the wits to understand that
> > >>> > you are doing a prefix scan and thus limit the scan to only the
> required
> > rows.
> > >>> There's
> > >>> > a bunch of optimizations like that that need to be done.
> > >>> >
> > >>> > I'm pretty sure Pig does the same thing, but don't take my word
on
> it.
> > >>> >
> > >>> > J-D
> > >>> >
> > >>> > On Sun, Sep 18, 2011 at 4:12 AM, Matthew Tovbin
> > >>> > <matthew@tovbin.com>
> > >>> > wrote:
> > >>> > > Hi guys,
> > >>> > >
> > >>> > > I've got a table in Hbase let's say "tbl" and I would like
to
> > >>> > > query it using Hive. Therefore I mapped a table to hive as
> follows:
> > >>> > >
> > >>> > > CREATE EXTERNAL TABLE tbl(id string, data map<string,string>)
> > >>> > > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> > >>> > > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,data:")
> > >>> > > TBLPROPERTIES("hbase.table.name" = "tbl");
> > >>> > >
> > >>> > > Queries like: "select * from tbl", "select id from tbl",
"select
> > >>> > > id, data from tbl" are really fast.
> > >>> > > But queries like "select id from tbl where substr(id, 0,
5) =
> "12345""
> > >>> > > or "select id from tbl where data["777"] IS NOT NULL" are
> > >>> > > incredibly
> > >>> slow.
> > >>> > >
> > >>> > > In the contrary when running from Hbase shell: "scan 'tbl',
{
> > >>> > > COLUMNS=>'data', STARTROW='12345', ENDROW='12346'}" or
"scan
> > >>> > > 'tbl', { COLUMNS=>'data', "FILTER" =>
> > >>> > > FilterList.new([qualifierFilter('777')])}"
> > >>> > > it is lightning fast!
> > >>> > >
> > >>> > > When I looked into the mapred job generated by hive on
> > >>> > > jobtracker I discovered that "map.input.records" counts ALL
the
> > >>> > > items in Hbase table, meaning the job makes a full table
scan
> > >>> > > before it even starts
> > >>> any
> > >>> > mappers!!
> > >>> > > Moreover, I suspect it copies all the data from Hbase table
to
> > >>> > > hdfs to mapper tmp input folder before executuion.
> > >>> > >
> > >>> > > So, my questions are - Why hbase storage handler for hive
does
> > >>> > > not translate hive queries into appropriate hbase functions?
Why
> > >>> > > it scans all the records and then slices them using "where"
> > >>> > > clause? How can it be improved? Is Pig's integration better
in
> this
> > case?
> > >>> > >
> > >>> > >
> > >>> > > Some additional information about the tables:
> > >>> > > Table description in Hbase:
> > >>> > > jruby-1.6.2 :011 >   describe 'tbl'
> > >>> > > DESCRIPTION
> > >>> > >              ENABLED
> > >>> > >  {NAME => 'users', FAMILIES => [{NAME => 'data',
BLOOMFILTER =>
> > >>> > >'ROWCOL', REPLICATIO true
> > >>> > >  N_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS =>
'3', TTL =>
> > >>> > >'2147483647', BLOCKSIZE =>
> > >>> > >  '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
> > >>> > >
> > >>> > > Table desciption in Hive:
> > >>> > > hive> describe tbl;
> > >>> > > OK
> > >>> > > id string from deserializer
> > >>> > > data map<string,string> from deserializer Time taken:
0.08
> > >>> > > seconds
> > >>> > >
> > >>> > > Best regards,
> > >>> > >   Matthew Tovbin =)
> > >>> > >
> > >>>
> > >>
> > >>
> > >
> > >
> > >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message