apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bhupesh Chawda <bhup...@datatorrent.com>
Subject Re: Adding features to HBase Input Operators in Malhar-contrib
Date Thu, 17 Mar 2016 10:46:02 GMT
Hi,

I have opened a pull request for the changes as described in the previous
emails. Here is the pull request:
https://github.com/apache/incubator-apex-malhar/pull/212

Here is a short description of the changes:

HBaseInputOperator - Takes care of HBaseStore and its connection. Got rid
of HBaseOperatorBase.
HBaseScanOperator - Takes care of scanning the table in a non-blocking
manner. Exposes operationScan() and getTuple() as before.
HBasePOJOInputOperator - Implements operationScan() and getTuple() and
outputs a POJO on the output port.

Please help review these changes.

Thanks
~Bhupesh

On Fri, Mar 11, 2016 at 4:42 PM, Bhupesh Chawda <bhupesh@datatorrent.com>
wrote:

> Hi All,
>
> In the current design of HBase input and output operators, the row key is
> hard-coded to be of String type.
> I foresee the following issue:
>
>    - In case of numeric keys which are type casted to String, *incremental
>    read* is problematic. For example, after reading key = 9, we may not
>    be able to read any record with say, key = 8888, when though numerically
>    8888 > 9, lexicographically "9" > "8888".
>    - This is the case only when data is being written to HBase and being
>    read from simultaneously.
>
> My suggestion is to parametrize the type of row key in the HBase input and
> output operators, and let the user instantiate the required type for row
> key. We can have default implementations for String and/ or Long. By
> parametrizing the row key type, the user can even use complex row keys
> which are a combination of multiple fields.
>
> Thoughts?
>
> PS: I understand that there is a performance concern in making a
> monotonically increasing key as the row key. Given that, how do we address
> the incremental read scenario?
>
> Thanks
>
> -Bhupesh
>
> On Wed, Dec 30, 2015 at 7:49 PM, Sandeep Deshmukh <sandeep@datatorrent.com
> > wrote:
>
>> Looks fine to me.
>>
>> Regards,
>> Sandeep
>>
>> On Wed, Dec 30, 2015 at 7:34 PM, Bhupesh Chawda <bhupesh@datatorrent.com>
>> wrote:
>>
>> > Here is the final hierarchy I am considering:
>> >
>> > HBaseInputOperator - Takes care of HBaseStore and its connection. Got
>> rid
>> > of HBaseOperatorBase.
>> >     HBaseScanOperator - Takes care of scanning the table in a
>> non-blocking
>> > manner. Exposes operationScan() and getTuple() as before.
>> >         HBasePOJOInputOperator - Implements operationScan() and
>> getTuple()
>> > and outputs a POJO on the output port.
>> >
>> > Comments?
>> >
>> > -Bhupesh
>> >
>> >
>> > On Wed, Dec 30, 2015 at 2:52 PM, Bhupesh Chawda <
>> bhupesh@datatorrent.com>
>> > wrote:
>> >
>> > > The class HBaseInputOperator seems to be quite old. HBaseStore seems
>> to
>> > be
>> > > having all the functionality provided by HBaseInputOperator and even
>> more
>> > > (including Kerberos authentication).
>> > >
>> > > It would be a good idea to avoid the usage of HBaseInputOperator going
>> > > forward and use HBaseStore instead.
>> > >
>> > > I will also work on abstracting out the HBase input functionality in
>> the
>> > > HBaseInputOperator, which can be extended by concrete implementations.
>> > >
>> > > -Bhupesh
>> > >
>> > > On Wed, Dec 23, 2015 at 7:47 PM, Bhupesh Chawda <
>> bhupesh@datatorrent.com
>> > >
>> > > wrote:
>> > >
>> > >> Thanks for the inputs.
>> > >> As an input operator, I am targeting just the Scan operation. Get
>> > >> operation may be supported better as a generic operator (like a query
>> > >> operator) which I can take up later.
>> > >>
>> > >> -Bhupesh
>> > >>
>> > >> On Tue, Dec 22, 2015 at 3:48 PM, Mohit Jotwani <
>> mohit@datatorrent.com>
>> > >> wrote:
>> > >>
>> > >>> +1
>> > >>>
>> > >>> Regards,
>> > >>> Mohit
>> > >>>
>> > >>> On Tue, Dec 22, 2015 at 11:21 AM, Chinmay Kolhatkar <
>> > >>> chinmay@datatorrent.com
>> > >>> > wrote:
>> > >>>
>> > >>> > +1 for above.
>> > >>> > I see that there is HbaseGetOperator but but its abstract
no
>> concrete
>> > >>> > implementation of this I can find.
>> > >>> > Are you going to implement of that too?
>> > >>> >
>> > >>> > Maybe the concrete implementation of HbaseGetOperator should
have
>> > this.
>> > >>> >
>> > >>> > Also, I want to mention one thing about scan from my previous
>> > >>> experience of
>> > >>> > Hbase. The Hbase client is synchronous.
>> > >>> > This means when you fire a scan call, until certain number
of
>> records
>> > >>> are
>> > >>> > received at client end, the function blocks.
>> > >>> > This causes a lot of problems in the current thread as it
might
>> just
>> > >>> get
>> > >>> > blocked for a long period of time.
>> > >>> > Plus, there are always network related latency to add to the
>> problem.
>> > >>> >
>> > >>> > Usually the way to deal with this is to fire scan like queries
on
>> a
>> > >>> > separate thread and then consume the results in the main thread.
>> > >>> >
>> > >>> > Please take care of this scenario while implementation of
scan
>> > >>> operator.
>> > >>> >
>> > >>> > -Chinmay.
>> > >>> >
>> > >>> >
>> > >>> > ~ Chinmay.
>> > >>> >
>> > >>> > On Tue, Dec 22, 2015 at 11:08 AM, Sandeep Deshmukh <
>> > >>> > sandeep@datatorrent.com>
>> > >>> > wrote:
>> > >>> >
>> > >>> > > +1 for this Bhupesh.
>> > >>> > >
>> > >>> > > Additionally, I would suggest to add support for;
>> > >>> > > 1. Point query
>> > >>> > > 2. Returning any row version
>> > >>> > >
>> > >>> > > The above two are key features of HBase and should be
supported.
>> > >>> > >
>> > >>> > > Regards,
>> > >>> > > Sandeep
>> > >>> > >
>> > >>> > > On Fri, Dec 18, 2015 at 4:39 PM, Bhupesh Chawda <
>> > >>> bhupesh@datatorrent.com
>> > >>> > >
>> > >>> > > wrote:
>> > >>> > >
>> > >>> > > > Hi All,
>> > >>> > > >
>> > >>> > > > The current HBasePOJOInputOperator does not allow
us to do the
>> > >>> > following:
>> > >>> > > >
>> > >>> > > >    1. Allow us to specify a set of "column family:
column" and
>> > >>> fetch
>> > >>> > data
>> > >>> > > >    only for these columns.
>> > >>> > > >    2. Output format is currently a POJO. We need
to have other
>> > >>> output
>> > >>> > > >    formats such that "columnFamily:column" representation
is
>> > >>> supported.
>> > >>> > > > Map /
>> > >>> > > >    CSV are some of the options.
>> > >>> > > >    3. Allow specifying "end row-key" to stop scanning
a table.
>> > >>> > > >    4. No metrics.
>> > >>> > > >
>> > >>> > > > I am planning to add the above functionality to
the HBase
>> Input
>> > >>> > > operators.
>> > >>> > > > These features may go into the HBaseScanOperator
/
>> > >>> > > HBasePOJOInputOperator.
>> > >>> > > >
>> > >>> > > > Please let me know your comments.
>> > >>> > > >
>> > >>> > > > Thanks.
>> > >>> > > >
>> > >>> > > > Bhupesh
>> > >>> > > >
>> > >>> > >
>> > >>> >
>> > >>>
>> > >>
>> > >>
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message