hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "BELUGA BEHR (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-20484) Disable Block Cache By Default With HBase SerDe
Date Thu, 31 Jan 2019 17:19:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-20484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

BELUGA BEHR updated HIVE-20484:
-------------------------------
    Description: 
{quote}
Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks
method. For input Scans to MapReduce jobs, this should be false. 

https://hbase.apache.org/book.html#perf.hbase.client.blockcache
{quote}

However, from the Hive code, we can see that this is not the case.

{code}
public static final String HBASE_SCAN_CACHEBLOCKS = "hbase.scan.cacheblock";

...

String scanCacheBlocks = tableProperties.getProperty(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
if (scanCacheBlocks != null) {
  jobProperties.put(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS, scanCacheBlocks);
}

...

String scanCacheBlocks = jobConf.get(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
if (scanCacheBlocks != null) {
  scan.setCacheBlocks(Boolean.parseBoolean(scanCacheBlocks));
}
{code}

In the Hive code, we can see that if {{hbase.scan.cacheblock}} is not specified in the {{SERDEPROPERTIES}}
then {{setCacheBlocks}} is not called and the default value of the HBase {{Scan}} class is
used.

{code:java|title=Scan.java}
  /**
   * Set whether blocks should be cached for this Scan.
   * <p>
   * This is true by default.  When true, default settings of the table and
   * family are used (this will never override caching blocks if the block
   * cache is disabled for that family or entirely).
   *
   * @param cacheBlocks if false, default settings are overridden and blocks
   * will not be cached
   */
  public Scan setCacheBlocks(boolean cacheBlocks) {
    this.cacheBlocks = cacheBlocks;
    return this;
  }
{code}

Hive is doing full scans of the table with MapReduce/Spark and therefore, according to the
HBase docs, the default behavior here should be that blocks are not cached.  Hive should set
this value to "false" by default unless the table {{SERDEPROPERTIES}} override this.

{code:sql}
-- Commands for HBase
-- create 'test', 't'

CREATE EXTERNAL TABLE test(value map<string,string>, row_key string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "t:,:key",
"hbase.scan.cacheblock" = "true"
);
{code}

  was:
{quote}
Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks
method. For input Scans to MapReduce jobs, this should be false. 

https://hbase.apache.org/book.html#perf.hbase.client.blockcache
{quote}

However, from the Hive code, we can see that this is not the case.

{code}
public static final String HBASE_SCAN_CACHEBLOCKS = "hbase.scan.cacheblock";

...

String scanCacheBlocks = tableProperties.getProperty(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
if (scanCacheBlocks != null) {
  jobProperties.put(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS, scanCacheBlocks);
}

...

String scanCacheBlocks = jobConf.get(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
if (scanCacheBlocks != null) {
  scan.setCacheBlocks(Boolean.parseBoolean(scanCacheBlocks));
}
{code}

In the Hive code, we can see that if {{hbase.scan.cacheblock}} is not specified in the {{SERDEPROPERTIES}}
then {{setCacheBlocks}} is not called and the default value of the HBase {{Scan}} class is
used.

{code:java|title=Scan.java}
  /**
   * Set whether blocks should be cached for this Scan.
   * <p>
   * This is true by default.  When true, default settings of the table and
   * family are used (this will never override caching blocks if the block
   * cache is disabled for that family or entirely).
   *
   * @param cacheBlocks if false, default settings are overridden and blocks
   * will not be cached
   */
  public Scan setCacheBlocks(boolean cacheBlocks) {
    this.cacheBlocks = cacheBlocks;
    return this;
  }
{code}

Hive is doing full scans of the table with MapReduce/Spark and therefore, according to the
HBase docs, the default behavior here should be that blocks are not cached.  Hive should set
this value to "false" by default unless the table {{SERDEPROPERTIES}} override this.

{code:sql}
-- Commands for HBase
-- create 'test', 't'

CREATE EXTERNAL TABLE test(value map<string,string>, row_key string) 
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "t:,:key",
"hbase.scan.cacheblock" = "false"
);
{code}


> Disable Block Cache By Default With HBase SerDe
> -----------------------------------------------
>
>                 Key: HIVE-20484
>                 URL: https://issues.apache.org/jira/browse/HIVE-20484
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>    Affects Versions: 1.2.3, 2.4.0, 4.0.0, 3.2.0
>            Reporter: BELUGA BEHR
>            Assignee: BELUGA BEHR
>            Priority: Major
>             Fix For: 4.0.0, 3.2.0
>
>         Attachments: HIVE-20484.1.patch, HIVE-20484.2.patch, HIVE-20484.3.patch, HIVE-20484.4.patch,
HIVE-20484.5.patch
>
>
> {quote}
> Scan instances can be set to use the block cache in the RegionServer via the setCacheBlocks
method. For input Scans to MapReduce jobs, this should be false. 
> https://hbase.apache.org/book.html#perf.hbase.client.blockcache
> {quote}
> However, from the Hive code, we can see that this is not the case.
> {code}
> public static final String HBASE_SCAN_CACHEBLOCKS = "hbase.scan.cacheblock";
> ...
> String scanCacheBlocks = tableProperties.getProperty(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
> if (scanCacheBlocks != null) {
>   jobProperties.put(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS, scanCacheBlocks);
> }
> ...
> String scanCacheBlocks = jobConf.get(HBaseSerDe.HBASE_SCAN_CACHEBLOCKS);
> if (scanCacheBlocks != null) {
>   scan.setCacheBlocks(Boolean.parseBoolean(scanCacheBlocks));
> }
> {code}
> In the Hive code, we can see that if {{hbase.scan.cacheblock}} is not specified in the
{{SERDEPROPERTIES}} then {{setCacheBlocks}} is not called and the default value of the HBase
{{Scan}} class is used.
> {code:java|title=Scan.java}
>   /**
>    * Set whether blocks should be cached for this Scan.
>    * <p>
>    * This is true by default.  When true, default settings of the table and
>    * family are used (this will never override caching blocks if the block
>    * cache is disabled for that family or entirely).
>    *
>    * @param cacheBlocks if false, default settings are overridden and blocks
>    * will not be cached
>    */
>   public Scan setCacheBlocks(boolean cacheBlocks) {
>     this.cacheBlocks = cacheBlocks;
>     return this;
>   }
> {code}
> Hive is doing full scans of the table with MapReduce/Spark and therefore, according to
the HBase docs, the default behavior here should be that blocks are not cached.  Hive should
set this value to "false" by default unless the table {{SERDEPROPERTIES}} override this.
> {code:sql}
> -- Commands for HBase
> -- create 'test', 't'
> CREATE EXTERNAL TABLE test(value map<string,string>, row_key string) 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> WITH SERDEPROPERTIES (
> "hbase.columns.mapping" = "t:,:key",
> "hbase.scan.cacheblock" = "true"
> );
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message