hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Dimiduk <ndimi...@gmail.com>
Subject Re: Performance issue in the Join query on the HBase tables
Date Fri, 29 Sep 2017 14:22:47 GMT
Have you considered running Hive/Spark over snapshots of your HBase tables?

If you're seeing network saturation over HBase but not hdfs, makes me think
data locality is not being honored. Might be worth investigating as well.

On Fri, Sep 29, 2017 at 3:26 AM wenxing zheng <wenxing.zheng@gmail.com>
wrote:

> Dear all,
>
> I have 3 big HBase tables, which all have millions of rows(rows are synced
> from MySQL DB via Bin log) and for each HBase table, we have an external
> table on Hive correspondingly with the storage by
> 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. The advantage is that
> we can always keep sync up with the production DB and provides random
> access by key.
>
> Now our business needs to do some analysis on those tables with Join query.
> What's the best practice to make it?
>
> From my experiment, I found that with the Spark SQL on HBase or Hive, the
> job ran very slowly and will saturate the network bandwidth. But it works
> very well for the Hive SQL directly against Hive from HDFS files(make a
> copy of the data to HDFS files).
>
> Appreciated for any advice on what would be the problem here? and the way
> to optimize the job.
> Regards, Wenxing
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message