apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhupesh Chawda <bhup...@datatorrent.com>
Subject Re: Adding features to HBase Input Operators in Malhar-contrib
Date Thu, 24 Mar 2016 12:40:27 GMT
Dear Community,

Can anyone help review the pull request:
https://github.com/apache/incubator-apex-malhar/pull/212

Thanks.

~Bhupesh

On Thu, Mar 17, 2016 at 4:16 PM, Bhupesh Chawda <bhupesh@datatorrent.com>
wrote:

> Hi,
>
> I have opened a pull request for the changes as described in the previous
> emails. Here is the pull request:
> https://github.com/apache/incubator-apex-malhar/pull/212
>
> Here is a short description of the changes:
>
> HBaseInputOperator - Takes care of HBaseStore and its connection. Got rid
> of HBaseOperatorBase.
> HBaseScanOperator - Takes care of scanning the table in a non-blocking
> manner. Exposes operationScan() and getTuple() as before.
> HBasePOJOInputOperator - Implements operationScan() and getTuple() and
> outputs a POJO on the output port.
>
> Please help review these changes.
>
> Thanks
> ~Bhupesh
>
> On Fri, Mar 11, 2016 at 4:42 PM, Bhupesh Chawda <bhupesh@datatorrent.com>
> wrote:
>
>> Hi All,
>>
>> In the current design of HBase input and output operators, the row key is
>> hard-coded to be of String type.
>> I foresee the following issue:
>>
>>    - In case of numeric keys which are type casted to String, *incremental
>>    read* is problematic. For example, after reading key = 9, we may not
>>    be able to read any record with say, key = 8888, when though numerically
>>    8888 > 9, lexicographically "9" > "8888".
>>    - This is the case only when data is being written to HBase and being
>>    read from simultaneously.
>>
>> My suggestion is to parametrize the type of row key in the HBase input
>> and output operators, and let the user instantiate the required type for
>> row key. We can have default implementations for String and/ or Long. By
>> parametrizing the row key type, the user can even use complex row keys
>> which are a combination of multiple fields.
>>
>> Thoughts?
>>
>> PS: I understand that there is a performance concern in making a
>> monotonically increasing key as the row key. Given that, how do we address
>> the incremental read scenario?
>>
>> Thanks
>>
>> -Bhupesh
>>
>> On Wed, Dec 30, 2015 at 7:49 PM, Sandeep Deshmukh <
>> sandeep@datatorrent.com> wrote:
>>
>>> Looks fine to me.
>>>
>>> Regards,
>>> Sandeep
>>>
>>> On Wed, Dec 30, 2015 at 7:34 PM, Bhupesh Chawda <bhupesh@datatorrent.com
>>> >
>>> wrote:
>>>
>>> > Here is the final hierarchy I am considering:
>>> >
>>> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got
>>> rid
>>> > of HBaseOperatorBase.
>>> >     HBaseScanOperator - Takes care of scanning the table in a
>>> non-blocking
>>> > manner. Exposes operationScan() and getTuple() as before.
>>> >         HBasePOJOInputOperator - Implements operationScan() and
>>> getTuple()
>>> > and outputs a POJO on the output port.
>>> >
>>> > Comments?
>>> >
>>> > -Bhupesh
>>> >
>>> >
>>> > On Wed, Dec 30, 2015 at 2:52 PM, Bhupesh Chawda <
>>> bhupesh@datatorrent.com>
>>> > wrote:
>>> >
>>> > > The class HBaseInputOperator seems to be quite old. HBaseStore seems
>>> to
>>> > be
>>> > > having all the functionality provided by HBaseInputOperator and even
>>> more
>>> > > (including Kerberos authentication).
>>> > >
>>> > > It would be a good idea to avoid the usage of HBaseInputOperator
>>> going
>>> > > forward and use HBaseStore instead.
>>> > >
>>> > > I will also work on abstracting out the HBase input functionality in
>>> the
>>> > > HBaseInputOperator, which can be extended by concrete
>>> implementations.
>>> > >
>>> > > -Bhupesh
>>> > >
>>> > > On Wed, Dec 23, 2015 at 7:47 PM, Bhupesh Chawda <
>>> bhupesh@datatorrent.com
>>> > >
>>> > > wrote:
>>> > >
>>> > >> Thanks for the inputs.
>>> > >> As an input operator, I am targeting just the Scan operation. Get
>>> > >> operation may be supported better as a generic operator (like a
>>> query
>>> > >> operator) which I can take up later.
>>> > >>
>>> > >> -Bhupesh
>>> > >>
>>> > >> On Tue, Dec 22, 2015 at 3:48 PM, Mohit Jotwani <
>>> mohit@datatorrent.com>
>>> > >> wrote:
>>> > >>
>>> > >>> +1
>>> > >>>
>>> > >>> Regards,
>>> > >>> Mohit
>>> > >>>
>>> > >>> On Tue, Dec 22, 2015 at 11:21 AM, Chinmay Kolhatkar <
>>> > >>> chinmay@datatorrent.com
>>> > >>> > wrote:
>>> > >>>
>>> > >>> > +1 for above.
>>> > >>> > I see that there is HbaseGetOperator but but its abstract
no
>>> concrete
>>> > >>> > implementation of this I can find.
>>> > >>> > Are you going to implement of that too?
>>> > >>> >
>>> > >>> > Maybe the concrete implementation of HbaseGetOperator
should have
>>> > this.
>>> > >>> >
>>> > >>> > Also, I want to mention one thing about scan from my previous
>>> > >>> experience of
>>> > >>> > Hbase. The Hbase client is synchronous.
>>> > >>> > This means when you fire a scan call, until certain number
of
>>> records
>>> > >>> are
>>> > >>> > received at client end, the function blocks.
>>> > >>> > This causes a lot of problems in the current thread as
it might
>>> just
>>> > >>> get
>>> > >>> > blocked for a long period of time.
>>> > >>> > Plus, there are always network related latency to add
to the
>>> problem.
>>> > >>> >
>>> > >>> > Usually the way to deal with this is to fire scan like
queries
>>> on a
>>> > >>> > separate thread and then consume the results in the main
thread.
>>> > >>> >
>>> > >>> > Please take care of this scenario while implementation
of scan
>>> > >>> operator.
>>> > >>> >
>>> > >>> > -Chinmay.
>>> > >>> >
>>> > >>> >
>>> > >>> > ~ Chinmay.
>>> > >>> >
>>> > >>> > On Tue, Dec 22, 2015 at 11:08 AM, Sandeep Deshmukh <
>>> > >>> > sandeep@datatorrent.com>
>>> > >>> > wrote:
>>> > >>> >
>>> > >>> > > +1 for this Bhupesh.
>>> > >>> > >
>>> > >>> > > Additionally, I would suggest to add support for;
>>> > >>> > > 1. Point query
>>> > >>> > > 2. Returning any row version
>>> > >>> > >
>>> > >>> > > The above two are key features of HBase and should
be
>>> supported.
>>> > >>> > >
>>> > >>> > > Regards,
>>> > >>> > > Sandeep
>>> > >>> > >
>>> > >>> > > On Fri, Dec 18, 2015 at 4:39 PM, Bhupesh Chawda <
>>> > >>> bhupesh@datatorrent.com
>>> > >>> > >
>>> > >>> > > wrote:
>>> > >>> > >
>>> > >>> > > > Hi All,
>>> > >>> > > >
>>> > >>> > > > The current HBasePOJOInputOperator does not
allow us to do
>>> the
>>> > >>> > following:
>>> > >>> > > >
>>> > >>> > > >    1. Allow us to specify a set of "column family:
column"
>>> and
>>> > >>> fetch
>>> > >>> > data
>>> > >>> > > >    only for these columns.
>>> > >>> > > >    2. Output format is currently a POJO. We
need to have
>>> other
>>> > >>> output
>>> > >>> > > >    formats such that "columnFamily:column" representation
is
>>> > >>> supported.
>>> > >>> > > > Map /
>>> > >>> > > >    CSV are some of the options.
>>> > >>> > > >    3. Allow specifying "end row-key" to stop
scanning a
>>> table.
>>> > >>> > > >    4. No metrics.
>>> > >>> > > >
>>> > >>> > > > I am planning to add the above functionality
to the HBase
>>> Input
>>> > >>> > > operators.
>>> > >>> > > > These features may go into the HBaseScanOperator
/
>>> > >>> > > HBasePOJOInputOperator.
>>> > >>> > > >
>>> > >>> > > > Please let me know your comments.
>>> > >>> > > >
>>> > >>> > > > Thanks.
>>> > >>> > > >
>>> > >>> > > > Bhupesh
>>> > >>> > > >
>>> > >>> > >
>>> > >>> >
>>> > >>>
>>> > >>
>>> > >>
>>> > >
>>> >
>>>
>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message