crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <jwi...@cloudera.com>
Subject Re: Question about HBaseSourceTarget
Date Tue, 17 Mar 2015 19:53:28 GMT
Would this help for 0.99+?

https://issues.apache.org/jira/browse/HBASE-10413

On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <gabriel.reid@gmail.com>
wrote:

> That sounds like it would work pretty well, although the situation where a
> custom Scan is used is still problematic.
>
> I think Hannibal [1] does some clever stuff as far as figuring out data
> size as well (I think just via HBase RPC and not by looking at HDFS), there
> could be some useful ideas in there.
>
> - Gabriel
>
> 1. https://github.com/sentric/hannibal
>
>
> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mkwhitacre@gmail.com>
> wrote:
>
>> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>>  The case where this would overestimate would be if someone pre-split a
>> table upon creation.   Otherwise as the table fills up over time in theory
>> each region would grow and split evenly (and possibly hit max size and
>> therefore split again).
>>
>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Also open to suggestion here-- this has annoyed me for some time (as
>>> Gabriel pointed out), but I don't have a good fix for it.
>>>
>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <gabriel.reid@gmail.com>
>>> wrote:
>>>
>>>> Hi Nithin,
>>>>
>>>> This is a long-standing issue in Crunch (I think it's been present
>>>> since Crunch was originally open-sourced). I'd love to get this fixed
>>>> somehow, although it seems to not be trivial to do -- it can be difficult
>>>> to accurately estimate the size of data that will come from an HBase table,
>>>> especially considering that filters and selections of a subset of columns
>>>> can be done on an HBase table.
>>>>
>>>> One short-term way of working around this is to add a simple identity
>>>> function directly after the HBaseSourceTarget that implements the
>>>> scaleFactor method to manipulate the calculated size of the HBase data, but
>>>> this is just another hack.
>>>>
>>>> Maybe the better solution would be to estimate the size of the HBase
>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>> method, and then also overload the HBaseFrom.table(String, Scan) method to
>>>> also take a long value which is the estimated byte size (or perhaps scale
>>>> factor) of the table content that is expected to be returned from the given
>>>> Scan.
>>>>
>>>> Any thoughts on either of these?
>>>>
>>>> - Gabriel
>>>>
>>>>
>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <anithin19@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I came across a unique behavior while using HBaseSourceTarget. Suppose
>>>>> I
>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>> HBaseSourceTarget
>>>>> and passes all the data to a reduce phase, the number of reducers set
>>>>> by
>>>>> planner will be equal to 1. The reason being [1]. So, it looks like the
>>>>> planner assumes there is only about 1Gb of data that's read from the
>>>>> source, and sets the number of reducers accordingly. However, let's
>>>>> say my
>>>>> HBase scan is returning very less data or huge amounts of data. The
>>>>> planner
>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>>>>> interesting is, if there are dependent jobs, the planner will set the
>>>>> number of reducers based on the initially determined size from HBase
>>>>> source.
>>>>>
>>>>> As a fix for the above problem, I can set the number of reducers on the
>>>>> groupByKey(), but that does not offer much flexibility when dealing
>>>>> with
>>>>> data that is of varying sizes. The other option, is to have a map only
>>>>> job
>>>>> that reads from HBase and writes to HDFS and have a run(). The next job
>>>>> will determine the size right, since FileSourceImpl calculates the
>>>>> size on
>>>>> disk.
>>>>>
>>>>> I noticed the comment on HBaseSourceTarget, and was wondering if there
>>>>> was
>>>>> anything planned to have it implemented.
>>>>>
>>>>> [1]
>>>>>
>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>
>>>>> Thanks
>>>>> Nithin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>
>>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Mime
View raw message