apex-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chandni Singh <chan...@datatorrent.com>
Subject Re: Cassandra Operator Code Review
Date Thu, 10 Dec 2015 23:09:22 GMT
Thanks Russel!

We will make these fixes. Will get back to you in case we have more
questions.

Have created a JIRA:
https://malhar.atlassian.net/browse/MLHR-1935

Thanks
Chandni

On Thu, Dec 10, 2015 at 2:47 PM, Russell Spitzer <russell@datastax.com>
wrote:

> Hi, I'm Russell and Software Engineer at DataStax and I work on the Spark
> Cassandra Connector. I am excited about Apex as a great streaming solution
> so I took a at the integration with C* and I had a few comments
>
> https://github.com/apache/incubator-apex
>
> -malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/cassandra/AbstractCassandraTransactionableOutputOperator.java
>
> This behavior is a bit scary for me, building large batches like this
> (especially non partition specific) can lead to some stability problems
> over time. Hint build up can be a concern since those are stored in C* as
> well Pre C* 3.0. Originally the spark C* connector used batches of 64kb but
> this caused a large amount of problems on clusters with a HI RF or poorly
> provisioned setups. Some method to lock the total batch size down may be
> useful.
>
> The other issue is that the "Atomicity" of the batch is a point of serious
> fights within the C* community. One of the biggest issues being that do to
> the nature of repair and entropy in the system the Atomicity of a batch
> cannot be guaranteed in a traditional database sort of way. The guarantee
> breaks completely in Multi-DC environments for example.
>
> All this said, it is probably sufficient from a Data Loss perspective if
> the CL is high enough and the batches are small enough.
>
> https://issues.apache.org/jira/browse/CASSANDRA-10701
>
> There are some other "Caveats" to batches that you should also be aware of.
> For example a batch containing INSERT ( 1, 2, 1) and  INSERT (1 ,1, 2) will
> treat these inserts as having occurred at the same timestamp (unless they
> are manually adjusted). Which will end up with a row (1 , 2, 2) based on
> the greatest value of colliding rows.
>
> I also don't see a way to adjust Consistency Level here?
>
> https://github.com/apache/incubator-apex
>
> -malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/cassandra/CassandraPOJOInputOperator.java
>
> The metadata for any give table can be retrieved without running a query
> via the driver's Cluster's metadata object. May be better for future
> proofing?
>
> This class may also want to allow the pushdown of projections to C* to
> limit columns retrieved or if ambitious, pushdown of clustering column
> predicates.
>
>
>
> https://github.com/apache/incubator-apex
>
> -malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/cassandra/CassandraPOJOOutputOperator.java
>
> Same metadata comment as with Input and CL comment
>
> https://github.com/apache/incubator-apex
>
> -malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/cassandra/CassandraStore.java
>
>
> https://github.com/apache/incubator-apex
>
> -malhar/blob/master/contrib/src/main/java/com/datatorrent/contrib/cassandra/CassandraTransactionalStore.java
>
>
> I still haven't gotten to the Transaction Store but hopefully I can take a
> good read later.
>
> Thanks for your time,
> Russ
> --
> http://datastax.com/all/images/cs_logo_color_sm.png
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message