incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian O'Neill <b...@alumni.brown.edu>
Subject Re: Bitmap indexes - reviving CASSANDRA-1472
Date Fri, 12 Apr 2013 14:09:42 GMT
@Jason,

I have a lot of experience with SOLR + ES, but mainly for search.  (i.e.
Finding the most relevant records given a query)
That's been working well, but now we have requirements to support
dashboards.  Those dashboards have aggregations in them (sum, average,
count(s), etc).  I have limited experience using filter functions and
facets to achieve similar things w/ Lucene, but they never seemed to
perform well when the sets were large.

If Lucene/SOLR/ES can support this kind of functionality, we'd gladly use
it instead. (Let me know!)

When we looked around, Druid seemed to fit the bill exactly: (and it was
open source)
http://metamarkets.com/2011/druid-part-i-real-time-analytics-at-a-billion-r
ows-per-second/

BTW, here is more information on the compression that Druid uses:
http://metamarkets.com/2012/druid-bitmap-compression/


To echo Matt's sentiment, we'd love to leverage a C* native capability for
this.
(Acunu provides most of the capability, but it isn't open source)

I think once we have the "conditional write" semantics that are coming, we
could layer this on top of C*. (extending the secondary indexes
functionality)

-brian



---
Brian O'Neill
Lead Architect, Software Development
Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  €
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 4/12/13 12:46 AM, "Matt Stump" <mrevilgnome@gmail.com> wrote:

>You could embed Lucene, but then you pretty much have DSE search, and
>there
>are people on this list in a better position than I to describe
>the difficulty in making that scale. By rolling your own you get
>simplicity
>and control. If you use a uniform index size you can just assign chunks of
>it to the cassandra ring making it easy to distribute queries. I think
>that
>using Lucene in this way would cause most of the benefit of the library to
>be lost, and add unnecessary complexity. If Lucene were easy, then I think
>given the team's experience with both Lucene and C* it would have been
>done
>already.
>
>Sorry if it's a fuzzy answer, but I haven't run down every technical angle
>on the integration with C* yet. The idea was still very much in the
>wouldn't it be very cool if this thing lived in Cassandra. It would be the
>nail in the coffin for impala, redshift, et al.
>
>
>On Thu, Apr 11, 2013 at 3:15 PM, Jason Rutherglen <
>jason.rutherglen@gmail.com> wrote:
>
>> What's the advantage over Lucene?
>>
>>
>> On Wed, Apr 10, 2013 at 10:43 PM, Matt Stump <mrevilgnome@gmail.com>
>> wrote:
>>
>> > Druid was our inspiration to layer bitmap indexes on top of Cassandra.
>> > Druid doesn't work for us because or data set is too large. We would
>>need
>> > many hundreds of nodes just for the pre-processed data. What I
>>envisioned
>> > was the ability to perform druid style queries (no aggregation)
>>without
>> the
>> > limitations imposed by having the entire dataset in memory. I
>>primarily
>> > need to query whether a user performed some event, but I also intend
>>to
>> add
>> > trigram indexes for LIKE, ILIKE or possibly regex style matching.
>> >
>> > I wasn't aware of CONCISE, thanks for the pointer. We are currently
>> > evaluating fastbit, which is a very similar project:
>> > https://sdm.lbl.gov/fastbit/
>> >
>> >
>> > On Wed, Apr 10, 2013 at 5:49 PM, Brian O'Neill <bone@alumni.brown.edu
>> > >wrote:
>> >
>> > >
>> > > How does this compare with Druid?
>> > > https://github.com/metamx/druid
>> > >
>> > > We're currently evaluating Acunu, Vertica and Druid...
>> > >
>> > >
>> >
>> 
>>http://brianoneill.blogspot.com/2013/04/bianalytics-on-big-datacassandra.
>>html
>> > >
>> > > With its bitmapped indexes, Druid appears to have the most
>>potential.
>> > > They boast some pretty impressive stats, especially WRT handling
>> > > "real-time" updates and adding new dimensions.
>> > >
>> > > They also use a compression algorithm, CONCISE, to cut down on the
>> space
>> > > requirements.
>> > > http://ricerca.mat.uniroma3.it/users/colanton/concise.html
>> > >
>> > > I haven't looked too deep into the Druid code, but I've been
>>meaning to
>> > > see if it could be backed by C*.
>> > >
>> > > We'd be game to join the hunt if you pursue such a beast. (with your
>> > code,
>> > > or with portions of Druid)
>> > >
>> > > -brian
>> > >
>> > >
>> > > On Apr 10, 2013, at 5:40 PM, mrevilgnome wrote:
>> > >
>> > > > What do you think about set manipulation via indexes in Cassandra?
>> I'm
>> > > > interested in answering queries such as give me all users that
>> > performed
>> > > > event 1, 2, and 3, but not 4. If the answer is yes than I can
>>make a
>> > case
>> > > > for spending my time on C*. The only downside for us would be our
>> > current
>> > > > prototype is in C++ so we would loose some performance and the
>> ability
>> > to
>> > > > dedicate an entire machine to caching/performing queries.
>> > > >
>> > > >
>> > > > On Wed, Apr 10, 2013 at 11:57 AM, Jonathan Ellis
>><jbellis@gmail.com>
>> > > wrote:
>> > > >
>> > > >> If you mean, "Can someone help me figure out how to get started
>> > updating
>> > > >> these old patches to trunk and cleaning out the Avro?" then yes,
>> I've
>> > > been
>> > > >> knee-deep in indexing code recently.
>> > > >>
>> > > >>
>> > > >> On Wed, Apr 10, 2013 at 11:34 AM, mrevilgnome <
>> mrevilgnome@gmail.com>
>> > > >> wrote:
>> > > >>
>> > > >>> I'm currently building a distributed cluster on top of
>>cassandra to
>> > > >> perform
>> > > >>> fast set manipulation via bitmap indexes. This gives me the
>>ability
>> > to
>> > > >>> perform unions, intersections, and set subtraction across
>> > sub-queries.
>> > > >>> Currently I'm storing index information for thousands of
>>dimensions
>> > as
>> > > >>> cassandra rows, and my cluster keeps this information cached,
>> > > distributed
>> > > >>> and replicated in order to answer queries.
>> > > >>>
>> > > >>> Every couple of days I think to myself this should really
exist
>>in
>> > C*.
>> > > >>> Given all the benifits would there be any interest in
>> > > >>> reviving CASSANDRA-1472?
>> > > >>>
>> > > >>> Some downsides are that this is very memory intensive, even
for
>> > sparse
>> > > >>> bitmaps.
>> > > >>>
>> > > >>
>> > > >>
>> > > >>
>> > > >> --
>> > > >> Jonathan Ellis
>> > > >> Project Chair, Apache Cassandra
>> > > >> co-founder, http://www.datastax.com
>> > > >> @spyced
>> > > >>
>> > >
>> > > --
>> > > Brian ONeill
>> > > Lead Architect, Health Market Science
>>(http://healthmarketscience.com)
>> > > mobile:215.588.6024
>> > > blog: http://weblogs.java.net/blog/boneill42/
>> > > blog: http://brianoneill.blogspot.com/
>> > >
>> > >
>> >
>>



Mime
View raw message