hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From wenxing zheng <wenxing.zh...@gmail.com>
Subject Re: Performance issue in the Join query on the HBase tables
Date Sat, 30 Sep 2017 01:56:54 GMT
@Eric: for the trafodion, will take a look.

@Nick: And for the Hive/Spark over snapshots, I just have a try on the Hive
over HBase snapshots, the select(count) is much more faster than Hive over
HBase. Since the HBase tables are all so big, how to make the engine
respecting the data locality?

Thank you very much,



On Fri, Sep 29, 2017 at 10:22 PM, Nick Dimiduk <ndimiduk@gmail.com> wrote:

> Have you considered running Hive/Spark over snapshots of your HBase tables?
>
> If you're seeing network saturation over HBase but not hdfs, makes me think
> data locality is not being honored. Might be worth investigating as well.
>
> On Fri, Sep 29, 2017 at 3:26 AM wenxing zheng <wenxing.zheng@gmail.com>
> wrote:
>
> > Dear all,
> >
> > I have 3 big HBase tables, which all have millions of rows(rows are
> synced
> > from MySQL DB via Bin log) and for each HBase table, we have an external
> > table on Hive correspondingly with the storage by
> > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. The advantage is
> that
> > we can always keep sync up with the production DB and provides random
> > access by key.
> >
> > Now our business needs to do some analysis on those tables with Join
> query.
> > What's the best practice to make it?
> >
> > From my experiment, I found that with the Spark SQL on HBase or Hive, the
> > job ran very slowly and will saturate the network bandwidth. But it works
> > very well for the Hive SQL directly against Hive from HDFS files(make a
> > copy of the data to HDFS files).
> >
> > Appreciated for any advice on what would be the problem here? and the way
> > to optimize the job.
> > Regards, Wenxing
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message