crunch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gabriel Reid <gabriel.r...@gmail.com>
Subject Re: Question about HBaseSourceTarget
Date Thu, 19 Mar 2015 19:11:47 GMT
Hi Nithin,

Unfortunately, the HBase classes aren't included in the published API docs.
I just took a look at adding them, but it appears to be more complex than I
would have hoped -- I'll create a JIRA ticket to look into this further,
but I won't be able to get to it right away.

In any case, these HBase classes (HBaseFrom, HBaseTo) are in the
org.apache.crunch.io.hbase package in the crunch-hbase module.

- Gabriel


On Wed, Mar 18, 2015 at 2:16 AM Nithin Asokan <anithin19@gmail.com> wrote:

> Thanks for looking at this everyone.
>
> I can try the suggestion Gabriel posted here, I'm not familiar with the
> HBaseFrom.table(String) API, and tried searching online. It will be really
> helpful if someone can point me to the API.
>
> Thanks everyone!
>
> On Tue, Mar 17, 2015 at 3:34 PM Gabriel Reid <gabriel.reid@gmail.com>
> wrote:
>
>> Yep, that looks like it could be pretty handy -- according to that ticket
>> it's in 0.98.1 as well.
>>
>>
>> On Tue, Mar 17, 2015 at 8:54 PM Josh Wills <jwills@cloudera.com> wrote:
>>
>>> Would this help for 0.99+?
>>>
>>> https://issues.apache.org/jira/browse/HBASE-10413
>>>
>>> On Tue, Mar 17, 2015 at 12:35 PM, Gabriel Reid <gabriel.reid@gmail.com>
>>> wrote:
>>>
>>>> That sounds like it would work pretty well, although the situation
>>>> where a custom Scan is used is still problematic.
>>>>
>>>> I think Hannibal [1] does some clever stuff as far as figuring out data
>>>> size as well (I think just via HBase RPC and not by looking at HDFS), there
>>>> could be some useful ideas in there.
>>>>
>>>> - Gabriel
>>>>
>>>> 1. https://github.com/sentric/hannibal
>>>>
>>>>
>>>> On Tue, Mar 17, 2015 at 5:27 PM Micah Whitacre <mkwhitacre@gmail.com>
>>>> wrote:
>>>>
>>>>> Could we make an estimate based on # of regions * hbase.hregion.max.filesize?
>>>>>  The case where this would overestimate would be if someone pre-split
>>>>> a table upon creation.   Otherwise as the table fills up over time in
>>>>> theory each region would grow and split evenly (and possibly hit max
size
>>>>> and therefore split again).
>>>>>
>>>>> On Tue, Mar 17, 2015 at 11:20 AM, Josh Wills <jwills@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> Also open to suggestion here-- this has annoyed me for some time
(as
>>>>>> Gabriel pointed out), but I don't have a good fix for it.
>>>>>>
>>>>>> On Tue, Mar 17, 2015 at 9:10 AM, Gabriel Reid <gabriel.reid@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Nithin,
>>>>>>>
>>>>>>> This is a long-standing issue in Crunch (I think it's been present
>>>>>>> since Crunch was originally open-sourced). I'd love to get this
fixed
>>>>>>> somehow, although it seems to not be trivial to do -- it can
be difficult
>>>>>>> to accurately estimate the size of data that will come from an
HBase table,
>>>>>>> especially considering that filters and selections of a subset
of columns
>>>>>>> can be done on an HBase table.
>>>>>>>
>>>>>>> One short-term way of working around this is to add a simple
>>>>>>> identity function directly after the HBaseSourceTarget that implements
the
>>>>>>> scaleFactor method to manipulate the calculated size of the HBase
data, but
>>>>>>> this is just another hack.
>>>>>>>
>>>>>>> Maybe the better solution would be to estimate the size of the
HBase
>>>>>>> table based on its size on HDFS when using the HBaseFrom.table(String)
>>>>>>> method, and then also overload the HBaseFrom.table(String, Scan)
method to
>>>>>>> also take a long value which is the estimated byte size (or perhaps
scale
>>>>>>> factor) of the table content that is expected to be returned
from the given
>>>>>>> Scan.
>>>>>>>
>>>>>>> Any thoughts on either of these?
>>>>>>>
>>>>>>> - Gabriel
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 17, 2015 at 1:51 PM Nithin Asokan <anithin19@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello,
>>>>>>>> I came across a unique behavior while using HBaseSourceTarget.
>>>>>>>> Suppose I
>>>>>>>> have a job(from MRPipeline) that reads from HBase using
>>>>>>>> HBaseSourceTarget
>>>>>>>> and passes all the data to a reduce phase, the number of
reducers
>>>>>>>> set by
>>>>>>>> planner will be equal to 1. The reason being [1]. So, it
looks like
>>>>>>>> the
>>>>>>>> planner assumes there is only about 1Gb of data that's read
from the
>>>>>>>> source, and sets the number of reducers accordingly. However,
let's
>>>>>>>> say my
>>>>>>>> HBase scan is returning very less data or huge amounts of
data. The
>>>>>>>> planner
>>>>>>>> still assigns 1 reducer(crunch.bytes.per.reduce.task=1Gb).
What more
>>>>>>>> interesting is, if there are dependent jobs, the planner
will set
>>>>>>>> the
>>>>>>>> number of reducers based on the initially determined size
from
>>>>>>>> HBase source.
>>>>>>>>
>>>>>>>> As a fix for the above problem, I can set the number of reducers
on
>>>>>>>> the
>>>>>>>> groupByKey(), but that does not offer much flexibility when
dealing
>>>>>>>> with
>>>>>>>> data that is of varying sizes. The other option, is to have
a map
>>>>>>>> only job
>>>>>>>> that reads from HBase and writes to HDFS and have a run().
The next
>>>>>>>> job
>>>>>>>> will determine the size right, since FileSourceImpl calculates
the
>>>>>>>> size on
>>>>>>>> disk.
>>>>>>>>
>>>>>>>> I noticed the comment on HBaseSourceTarget, and was wondering
if
>>>>>>>> there was
>>>>>>>> anything planned to have it implemented.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>>> https://github.com/apache/crunch/blob/apache-crunch-0.8.4/crunch-hbase/src/main/java/org/apache/crunch/io/hbase/HBaseSourceTarget.java#L173
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Nithin
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Director of Data Science
>>>>>> Cloudera <http://www.cloudera.com>
>>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Director of Data Science
>>> Cloudera <http://www.cloudera.com>
>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>
>>

Mime
View raw message