crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Wills <>
Subject Re: Question about HBaseSourceTarget
Date Tue, 17 Mar 2015 16:20:59 GMT
Also open to suggestion here-- this has annoyed me for some time (as
Gabriel pointed out), but I don't have a good fix for it.

On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <>

> Hi Nithin,
> This is a long-standing issue in Crunch (I think it's been present since
> Crunch was originally open-sourced). I'd love to get this fixed somehow,
> although it seems to not be trivial to do -- it can be difficult to
> accurately estimate the size of data that will come from an HBase table,
> especially considering that filters and selections of a subset of columns
> can be done on an HBase table.
> One short-term way of working around this is to add a simple identity
> function directly after the HBaseSourceTarget that implements the
> scaleFactor method to manipulate the calculated size of the HBase data, but
> this is just another hack.
> Maybe the better solution would be to estimate the size of the HBase table
> based on its size on HDFS when using the HBaseFrom.table(String) method,
> and then also overload the HBaseFrom.table(String, Scan) method to also
> take a long value which is the estimated byte size (or perhaps scale
> factor) of the table content that is expected to be returned from the given
> Scan.
> Any thoughts on either of these?
> - Gabriel
> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <> wrote:
>> Hello,
>> I came across a unique behavior while using HBaseSourceTarget. Suppose I
>> have a job(from MRPipeline) that reads from HBase using HBaseSourceTarget
>> and passes all the data to a reduce phase, the number of reducers set by
>> planner will be equal to 1. The reason being [1]. So, it looks like the
>> planner assumes there is only about 1Gb of data that's read from the
>> source, and sets the number of reducers accordingly. However, let's say my
>> HBase scan is returning very less data or huge amounts of data. The
>> planner
>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb). What more
>> interesting is, if there are dependent jobs, the planner will set the
>> number of reducers based on the initially determined size from HBase
>> source.
>> As a fix for the above problem, I can set the number of reducers on the
>> groupByKey(), but that does not offer much flexibility when dealing with
>> data that is of varying sizes. The other option, is to have a map only job
>> that reads from HBase and writes to HDFS and have a run(). The next job
>> will determine the size right, since FileSourceImpl calculates the size on
>> disk.
>> I noticed the comment on HBaseSourceTarget, and was wondering if there was
>> anything planned to have it implemented.
>> [1]
>> Thanks
>> Nithin

Director of Data Science
Cloudera <>
Twitter: @josh_wills <>

View raw message