cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Proposal: freeze Thrift starting with 2.1.0
Date Wed, 12 Mar 2014 16:42:26 GMT
@Tushpin

I like that approach, right now I think of that piece as the
"StorageProxy". I agree, over the years people have take that approach.
Solandra and is a good example and I am guessing DSE SOLR works this way.
This says something about the entire "thrift vs cql" thing as there are
clearly power users writing applications that use neither.

I do feel this vote was called to shoot down any attempt to add a feature
that was non CQL. However if you think you can drive something like this
forward more power to you I will help out.





On Wed, Mar 12, 2014 at 12:11 PM, Tupshin Harper <tupshin@tupshin.com>wrote:

> I agree that we are way off the initial topic, but I think we are spot on
> the most important topic. As seen in various tickets, including #6704 (wide
> row scanners), #6167 (end-slice termination predicate), the existence
> of intravert-ug (Cassandra interface to intravert), and a number of others,
> there is an increasing desire to do more complicated processing,
> server-side, on a Cassandra cluster.
>
> I very much share those goals, and would like to propose the following
> only partially hand-wavey path forward.
>
> Instead of creating a pluggable interface for Thrift, I'd like to create a
> pluggable interface for arbitrary app-server deep integration.
>
> Inspired by both the existence of intravert-ug, as well as there being a
> long history of various parties embedding tomcat or jetty servlet engines
> inside Cassandra, I'd like to propose the creation an internal somewhat
> stable (versioned?) interface that could allow any app server to achieve
> deep integration with Cassandra, and as a result, these servers could
> 1) host their own apis (REST, for example
> 2) extend core functionality by having limited (see triggers and wide row
> scanners) access to the internals of cassandra
>
> The hand wavey part comes because while I have been mulling this about for
> a while, I have not spent any significant time into looking at the actual
> surface area of intravert-ug's integration. But, using it as a model, and
> also keeping in minds the general needs of your more traditional
> servlet/j2ee containers, I believe we could come up with a reasonable
> interface to allow any jvm app server to be integrated and maintained in or
> out of the Cassandra tree.
>
> This would satisfy the needs that many of us (Both Ed and I, for example)
> to have a much greater degree of control over server side execution, and to
> be able to start building much more interestingly (and simply) tiered
> applications.
>
> Anybody interested in working on a coherent proposal with me?
>
> -Tupshin
>
>
> On Wed, Mar 12, 2014 at 10:12 AM, Brian O'Neill <bone@alumni.brown.edu>wrote:
>
>>
>> just when you thought the thread died...
>>
>>
>> First, let me say we are *WAY* off topic.  But that is a good thing.
>> I love this community because there are a ton of passionate, smart
>> people. (often with differing perspectives ;)
>>
>> RE: Reporting against C* (@Peter Lin)
>> We've had the same experience.  Pig + Hadoop is painful.  We are
>> experimenting with Spark/Shark, operating directly against the data.
>> http://brianoneill.blogspot.com/2014/03/spark-on-cassandra-w-calliope.html
>>
>> The Shark layer gives you SQL and caching capabilities that make it easy
>> to use and fast (for smaller data sets).  In front of this, we are going to
>> add dimensional aggregations so we can operate at larger scales.  (then the
>> Hive reports will run against the aggregations)
>>
>> RE: REST Server (@Russel Bradbury)
>> We had moderate success with Virgil, which was a REST server built
>> directly on Thrift.  We built it directly on top of Thrift, so one day it
>> could be easily embedded in the C* server itself.   It could be deployed
>> separately, or run an embedded C*.  More often than not, we ended up
>> running it separately to separate the layers.  (just like Titan and
>> Rexster)  I've started on a rewrite of Virgil called Memnon that rides on
>> top of CQL. (I'd love some help)
>> https://github.com/boneill42/memnon
>>
>> RE: CQL vs. Thrift
>> We've hitched our wagons to CQL.  CQL != Relational.
>> We've had success translating our "native" schemas into CQL, including
>> all the NoSQL goodness of wide-rows, etc.  You just need a good
>> understanding of how things translate into storage and underlying CFs.  If
>> anything, I think we could add some DESCRIBE information, which would help
>> users with this, along the lines of:
>> (https://issues.apache.org/jira/browse/CASSANDRA-6676)
>>
>> CQL does open up the *opportunity* for users to articulate more complex
>> queries using more familiar syntax.  (including future things such as
>> joins, grouping, etc.)   To me, that is exciting, and again -- one of the
>> reasons we are leaning on it.
>>
>> my two cents,
>> brian
>>
>> ---
>>
>> Brian O'Neill
>>
>> Chief Technology Officer
>>
>>
>> *Health Market Science*
>>
>> *The Science of Better Results*
>>
>> 2700 Horizon Drive * King of Prussia, PA * 19406
>>
>> M: 215.588.6024 * @boneill42 <http://www.twitter.com/boneill42>  *
>>
>> healthmarketscience.com
>>
>>
>> This information transmitted in this email message is for the intended
>> recipient only and may contain confidential and/or privileged material. If
>> you received this email in error and are not the intended recipient, or the
>> person responsible to deliver it to the intended recipient, please contact
>> the sender at the email above and delete this email and any attachments and
>> destroy any copies thereof. Any review, retransmission, dissemination,
>> copying or other use of, or taking any action in reliance upon, this
>> information by persons or entities other than the intended recipient is
>> strictly prohibited.
>>
>>
>>
>>
>> From: Peter Lin <woolfel@gmail.com>
>> Reply-To: <user@cassandra.apache.org>
>> Date: Wednesday, March 12, 2014 at 8:44 AM
>> To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>> Subject: Re: Proposal: freeze Thrift starting with 2.1.0
>>
>>
>> yes, I was looking at intravert last nite.
>>
>> For the kinds of reports my customers ask us to do, joins and subqueries
>> are important. Having tried to do a simple join in PIG, the level of pain
>> is  high. I'm a masochist, so I don't mind breaking a simple join into
>> multiple MR tasks, though I do find myself asking "why the hell does it
>> need to be so painful in PIG?" Many of my friends say "what is this crap!"
>> or "this is better than writing sql queries to run reports?"
>>
>> Plus, using ETL techniques to extract summaries only works for cases
>> where the data is small enough. Once it gets beyond a certain size, it's
>> not practical, which means we're back to crappy reporting languages that
>> make life painful. Lots of big healthcare companies have thousands of MOLAP
>> cubes on dozens of mainframes. The old OLTP -> DW/OLAP creates it's own set
>> of management headaches.
>>
>> being able to report directly on the raw data avoids many of the issues,
>> but that's my bias perspective.
>>
>>
>>
>>
>> On Wed, Mar 12, 2014 at 8:15 AM, DuyHai Doan <doanduyhai@gmail.com>wrote:
>>
>>> "I would love to see Cassandra get to the point where users can define
>>> complex queries with subqueries, like, group by and joins" --> Did you have
>>> a look at Intravert ? I think it does union & intersection on server side
>>> for you. Not sure about join though..
>>>
>>>
>>> On Wed, Mar 12, 2014 at 12:44 PM, Peter Lin <woolfel@gmail.com> wrote:
>>>
>>>>
>>>> Hi Ed,
>>>>
>>>> I agree Solr is deeply integrated into DSE. I've looked at Solandra in
>>>> the past and studied the code.
>>>>
>>>> My understanding is DSE uses Cassandra for storage and the user has
>>>> both API available. I do think it can be integrated further to make
>>>> moderate to complex queries easier and probably faster. That's why we built
>>>> our own JPA-like object query API. I would love to see Cassandra get to the
>>>> point where users can define complex queries with subqueries, like, group
>>>> by and joins. Clearly lots of people want these features and even google
>>>> built their own tools to do these types of queries.
>>>>
>>>> I see lots of people trying to improve this with Presto, Impala, drill,
>>>> etc. To me, it's a natural progression as NoSql databases mature. For most
>>>> people, at some point you want to be able to report/analyze the data. Today
>>>> some people use MapReduce to summarize the data and ETL it into a
>>>> relational database or OLAP database for reporting. Even though I don't
>>>> need CAS or atomic batch for what I do in cassandra today, I'm sure in the
>>>> future it will be handy. From my experience in the financial and insurance
>>>> sector, features like CAS and "select for update" are important for the
>>>> kinds of transactions they handle. I'm bias, these kinds of features are
>>>> useful and good addition to cassandra.
>>>>
>>>> These are interesting times in database land!
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 11, 2014 at 10:57 PM, Edward Capriolo <
>>>> edlinuxguru@gmail.com> wrote:
>>>>
>>>>> Peter,
>>>>> Solr is deeply integrated into DSE. Seemingly this can not efficiently
>>>>> be done client side (CQL/Thrift whatever) but the Solandra approach was
to
>>>>> embed Solr in Cassandra. I think that is actually the future client dev,
>>>>> allowing users to embedded custom server side logic into there own API.
>>>>>
>>>>> Things like this take a while. Back in the day no one wanted cassandra
>>>>> to be heavy-weight and rejected ideas like read-before write operations.
>>>>> The common advice was "do them client side". Now in the case of collections
>>>>> sometimes they do read-before-write and it is the "stuff users want".
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 11, 2014 at 10:07 PM, Peter Lin <woolfel@gmail.com>
wrote:
>>>>>
>>>>>>
>>>>>> I'll give you a concrete example.
>>>>>>
>>>>>> One of the things we often need to do is do a keyword search on
>>>>>> unstructured text. What we did in our tooling is we combined solr
with
>>>>>> cassandra, but we put an Object API infront of it. The API is inspired
by
>>>>>> JPA, but designed specifically to fit our needs.
>>>>>>
>>>>>> the user can do queries with like %blah% and behind the scenes we
>>>>>> issues a query to solr to find the keys and then query cassandra
for the
>>>>>> records.
>>>>>>
>>>>>> With plain Cassandra, the developer has to manually do all of this
>>>>>> stuff and integrate solr. Then they have to know which system to
query and
>>>>>> in what order.  Our tooling lets the user define the schema in a
modeler.
>>>>>> Once the model is done, it compiles the classes, configuration files,
data
>>>>>> access objects and unit tests.
>>>>>>
>>>>>> when the application makes a call, our query classes handle the
>>>>>> details behind the scene. I know lots of people would like to see
Solr
>>>>>> integrated more deeply into Cassandra and CQL. I hope it happens
in the
>>>>>> future. If DataStax accepts my talk, we will be showing our temporal
>>>>>> database and modeler in september.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Mar 11, 2014 at 9:54 PM, Steven A Robenalt <
>>>>>> srobenal@stanford.edu> wrote:
>>>>>>
>>>>>>> I should add that I'm not trying to ignite a flame war. Just
trying
>>>>>>> to understand your intentions.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Mar 11, 2014 at 6:50 PM, Steven A Robenalt <
>>>>>>> srobenal@stanford.edu> wrote:
>>>>>>>
>>>>>>>> Okay, I'm officially lost on this thread. If you plan on
forking
>>>>>>>> Cassandra to preserve and continue to enhance the Thrift
interface, you
>>>>>>>> would also want to add a bunch of relational features to
CQL as part of
>>>>>>>> that same fork?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Mar 11, 2014 at 6:20 PM, Edward Capriolo <
>>>>>>>> edlinuxguru@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> "one of the things I'd like to see happen is for Cassandra
to
>>>>>>>>> support queries with disjunction, exist, subqueries,
joins and like. In
>>>>>>>>> theory CQL could support these features in the future.
Cassandra would need
>>>>>>>>> a new query compiler and query planner. I don't see how
the current design
>>>>>>>>> could do these things without a significant redesign/enhancement.
In a past
>>>>>>>>> life, I implemented an inference rule engine, so I've
spent over decade
>>>>>>>>> studying and implementing query optimizers. All of these
things can be
>>>>>>>>> done, it's just a matter of people finding the time to
do it."
>>>>>>>>>
>>>>>>>>> I see what your saying. CQL started as a way to make
slice easier
>>>>>>>>> but it is not even a query language, retrofitting these
things is going to
>>>>>>>>> be very hard.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Mar 11, 2014 at 7:45 PM, Peter Lin <woolfel@gmail.com>wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have no problems maintain my own fork :) or joining
others
>>>>>>>>>> forking cassandra.
>>>>>>>>>>
>>>>>>>>>> I'd be happy to work with you or anyone else to add
features to
>>>>>>>>>> thrift. That's the great thing about open source.
Each person can scratch a
>>>>>>>>>> technical itch and do what they love. I see lots
of potential for Cassandra
>>>>>>>>>> and many of them include improving thrift to make
it happen. Some of the
>>>>>>>>>> features in theory "could" be done in CQL, but not
with the current design.
>>>>>>>>>>
>>>>>>>>>> one of the things I'd like to see happen is for Cassandra
to
>>>>>>>>>> support queries with disjunction, exist, subqueries,
joins and like. In
>>>>>>>>>> theory CQL could support these features in the future.
Cassandra would need
>>>>>>>>>> a new query compiler and query planner. I don't see
how the current design
>>>>>>>>>> could do these things without a significant redesign/enhancement.
In a past
>>>>>>>>>> life, I implemented an inference rule engine, so
I've spent over decade
>>>>>>>>>> studying and implementing query optimizers. All of
these things can be
>>>>>>>>>> done, it's just a matter of people finding the time
to do it.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 11, 2014 at 6:17 PM, Edward Capriolo
<
>>>>>>>>>> edlinuxguru@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Peter,
>>>>>>>>>>>
>>>>>>>>>>> My advice. Do not bother. I have become very
active recently in
>>>>>>>>>>> attempting to add features to thrift. I had 4
open tickets I was actively
>>>>>>>>>>> working on. (I even found two bugs in the Cassandra
in the process).
>>>>>>>>>>>
>>>>>>>>>>> People were aware of this and still called this
vote. Several
>>>>>>>>>>> commit people have voted in a +1 and my -1 vote
is non binding. It is a
>>>>>>>>>>> clear message: The committers are unwilling to
accept new thrift features
>>>>>>>>>>> even if said features are contributed by others.
>>>>>>>>>>>
>>>>>>>>>>> Edward
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Mar 11, 2014 at 5:51 PM, Peter Lin <woolfel@gmail.com>wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> My bias opinion, just because some member
of cassandra develop
>>>>>>>>>>>> want to abandon Thrift, I see benefits of
continuing to improve it.
>>>>>>>>>>>>
>>>>>>>>>>>> The great thing about open source is that
as long as some
>>>>>>>>>>>> people want to keep working on it and improve
it, it can happen. I plan to
>>>>>>>>>>>> do my best to keep Thrift going, since it
gives me fine grain control that
>>>>>>>>>>>> I want and need. If the ultimate goal of
Cassandra is to be "as close to
>>>>>>>>>>>> SQL" as practical, my bias take is use a
NewSQL database that gives you the
>>>>>>>>>>>> full power of subqueries, like, exists and
disjunction.
>>>>>>>>>>>>
>>>>>>>>>>>> When customers ask me which database to choose
and they really
>>>>>>>>>>>> want Relational model, I tell them use NewSql.
I love that Cassandra sits
>>>>>>>>>>>> between NoSql and NewSql. There are things
I do in Cassandra today that are
>>>>>>>>>>>> much harder in NewSql or NoSql document databases.
NewSql database can
>>>>>>>>>>>> scale to similar sizes, so the "big" part
of big data won't be a
>>>>>>>>>>>> significant advantage forever. Looking at
some of the recent NewSql
>>>>>>>>>>>> performance numbers, it's clear the gap is
closing.
>>>>>>>>>>>>
>>>>>>>>>>>> peter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 11, 2014 at 3:59 PM, Tyler Hobbs
<
>>>>>>>>>>>> tyler@datastax.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Mar 11, 2014 at 2:41 PM, Shao-Chuan
Wang <
>>>>>>>>>>>>> shaochuan.wang@bloomreach.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, does anyone know how to do "describing
the splits" and
>>>>>>>>>>>>>> "describing the local rings" using
native protocol?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> For a ring description, you would do
something like "select
>>>>>>>>>>>>> peer, tokens from system.peers".  I'm
not sure about describe_splits().
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also, cqlsh uses python client, which
is talking via thrift
>>>>>>>>>>>>>> protocol too. Does it mean that it
will be migrated to native protocol soon
>>>>>>>>>>>>>> as well?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes: https://issues.apache.org/jira/browse/CASSANDRA-6307
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Tyler Hobbs
>>>>>>>>>>>>> DataStax <http://datastax.com/>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Steve Robenalt
>>>>>>>> Software Architect
>>>>>>>> HighWire | Stanford University
>>>>>>>> 425 Broadway St, Redwood City, CA 94063
>>>>>>>>
>>>>>>>> srobenal@stanford.edu
>>>>>>>> http://highwire.stanford.edu
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Steve Robenalt
>>>>>>> Software Architect
>>>>>>> HighWire | Stanford University
>>>>>>> 425 Broadway St, Redwood City, CA 94063
>>>>>>>
>>>>>>> srobenal@stanford.edu
>>>>>>> http://highwire.stanford.edu
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message