nifi-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Janssen (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NIFI-901) Create processors to get/put data with Apache Cassandra
Date Wed, 14 Oct 2015 04:31:05 GMT

    [ https://issues.apache.org/jira/browse/NIFI-901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14956226#comment-14956226
] 

Benjamin Janssen edited comment on NIFI-901 at 10/14/15 4:30 AM:
-----------------------------------------------------------------

Been brushing up on CQL and I'm starting to foresee some difficulties.

First is the issue that with CQL Cassandra loses a lot of the nice fancy features of NoSQL
databases.  There is no longer a way (from what I've been able to gather) to refer to a row
by row name + column name.  Instead, each table must have a schema assigned to it with the
row and column names being constructed from the fields that make up the "primary key" of the
SQL like language.  This makes it difficult to build a simple generic processor to read row
and column from the FlowFile attributes and dump the content into the cell and requires that
the processor somehow be schema aware.

For batching purposes on the Put side of things.  The CQL3 documentation seems to imply that
batching should not be used when seeking performance improvements (http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html)
but this seems to be mostly directed at the BATCH construct.  I think it would be fine to
batch (without using the BATCH key word) by buffering updates to a single primary key (note
that primary key in CQL refers to the combination of fields that defines the row AND column
that will be written to).  I'm not sure this level of buffering is worth doing.

Combining these two issues, I'm wondering if FlowFiles should be structured in such a way
that they have no content and the information to insert is contained solely within the attributes
or if perhaps the content should be required to be of a JSON type format that defines the
relevant information necessary for the update.  I think both of these approaches would limit
the overall size of the entry that could be inserted but I'm not sure we want to be loading
particularly huge objects into Cassandra anyways.

Thoughts?

Background for those not familiar with Cassandra and CQL:

Cassandra's original data model dealt with keyspace, column families, row keys, column keys,
and cells.  It's new data model (via the CQL API attempting to mimic SQL) essentially abstracts
away all of these underlying constructs.

Column Families are replaced by "Tables" in CQL.  Row key and column key are both replaced
by the "primary key" concept from SQL.  The first entry in the "primary key" is treated as
the legacy row key and the other entries are combined to form the legacy column key.  So your
typical SQL type columns in the CQL language are not necessarily columns at all in the Cassandra
backend.  They could be part of the row key, part of the column key, or even just one part
of the cell.

The big thing is that the concepts of a "cell" are really no longer present in the CQL data
model and writing a processor designed to write the contents of a FlowFile to a single cell
does not really work if we want to use modern Cassandra clients to interact with the cluster.


was (Author: bjanssen1):
Been brushing up on CQL and I'm starting to foresee some difficulties.

First is the issue that with CQL Cassandra loses a lot of the nice fancy features of NoSQL
databases.  There is no longer a way (from what I've been able to gather) to refer to a row
by row name + column name.  Instead, each table must have a schema assigned to it with the
row and column names being constructed from the fields that make up the "primary key" of the
SQL like language.  This makes it difficult to build a simple generic processor to read row
and column from the FlowFile attributes and dump the content into the cell and requires that
the processor somehow be schema aware.

For batching purposes on the Put side of things.  The CQL3 documentation seems to imply that
batching should not be used when seeking performance improvements (http://docs.datastax.com/en/cql/3.1/cql/cql_using/useBatch.html)
but this seems to be mostly directed at the BATCH construct.  I think it would be fine to
batch (without using the BATCH key word) by buffering updates to a single primary key (note
that primary key in CQL refers to the combination of fields that defines the row AND column
that will be written to).  I'm not sure this level of buffering is worth doing.

Combining these two issues, I'm wondering if FlowFiles should be structured in such a way
that they have no content and the information to insert is contained solely within the attributes
or if perhaps the content should be required to be of a JSON type format that defines the
relevant information necessary for the update.  I think both of these approaches would limit
the overall size of the entry that could be inserted but I'm not sure we want to be loading
particularly huge objects into Cassandra anyways.

Thoughts?

> Create processors to get/put data with Apache Cassandra
> -------------------------------------------------------
>
>                 Key: NIFI-901
>                 URL: https://issues.apache.org/jira/browse/NIFI-901
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Joseph Witt
>              Labels: beginner
>             Fix For: 0.4.0
>
>
> Develop processors to interact with Apache Cassandra.  The current http processors may
actually support this as is but such configuration may be too complex to provide the quality
user experience desired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message