hadoop-pig-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vincent BARAT (JIRA)" <j...@apache.org>
Subject [jira] Commented: (PIG-1029) HBaseStorage is way too slow to be usable
Date Wed, 10 Feb 2010 17:14:27 GMT

    [ https://issues.apache.org/jira/browse/PIG-1029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832075#action_12832075

Vincent BARAT commented on PIG-1029:

OK, I got the answer: the HBase scanner used to load the HBase table is using the default
HBase caching policy (see HTable.setScannerCaching())
For me it is set to 1 (and I don't know if I can change this using HBase config files). If
I set it to, say 1000, by modifying HBaseSlicer(), the load time is x10 faster.

Of course, the cache size depends on the size of the table rows, and thus it is not possible
to hard code a value in HBaseSlicer().

Even if this cache size can be configured globally using configuration files, I think the
HBaseStorage() should take an additional parameters (optional maybe) allowing to set the cache
size for the scanned table.

What I propose, if you agree, is to do the patch and submit it for integration in PIG.

> HBaseStorage is way too slow to be usable
> -----------------------------------------
>                 Key: PIG-1029
>                 URL: https://issues.apache.org/jira/browse/PIG-1029
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: 0.4.0
>            Reporter: Vincent BARAT
> I have performed a set of benchmarks on HBaseStorage loader, using PIG 0.4.0 and HBase
0.20.0 (using the patch referred in https://issues.apache.org/jira/browse/PIG-970) and Hadoop
> The HBaseStorage loader is basically 10x slower than the PigStorage loader.
> To bypass this limitation, I had to read my HBase tables, write them to a Hadoop file
and then use this file as input for my subsequent computations.
> I report this bug for the track, I will try to sse if I can optimise this a bit.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message