incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Kuebrich <>
Subject Re: solandra or pig or....?
Date Tue, 21 Jun 2011 19:16:28 GMT
Solandra is indeed distributed search, not distributed number-crunching.  As
a previous poster said, you could imagine structuring the data in a series
of documents with fields containing playername, teamname, position,
location, day, time, inning, at bat, outcome, etc.  Then you could query to
get a slice of the data that matches your predicate and run statistics on
that subset.

The statistics would have to come from other code (eg. R), but solr will
filter it for you. So, this approach only works if the slices are reasonably
small, but gives you great granularity on search as long as you put all the
info in.  The users of this datastore (or you) must be willing to write
their own simple aggregation functions ("show me only the unique player
names returned by this solr query", "show me the average of field X returned
by this solr query", ...)

If the numbers of results are too great, MR may be the way to go.

On Tue, Jun 21, 2011 at 3:04 PM, Victor K. <>wrote:

> If I may ask Sasha, what exactly are you trying to achieve using SolR (or
> Solandra, I guess it's about the same) ?
> Because from what I understood of your problem you need to do statistics on
> your matches, players etc... Or do you just want to retrieve information
> that are already been computed ?
> If it is the first thing you are trying to achieve (data aggregation,
> statistics, etc...) SolR won't be of a big use because it is not meant to do
> statistics. If you want to achieve the second then SolR is just the tool for
> you.
> On 6/21/2011 2:47 PM, Sasha Dolgy wrote:
>> Without getting overly complicated and long winded ... are there
>> practical references / examples I can review that demonstrate the
>> cassandra/solandra benefits....i had a quick look at
it wasn't
>> dead obvious to me....
>> On Tue, Jun 21, 2011 at 8:19 PM, Jake Luciani<>  wrote:
>>> Solandra can answer the question you used as an example and it's more of
>>> a
>>> fit for low-latency ad-hoc reporting then PIG.  Pig queries will take
>>> minutes not seconds.
>>> On Tue, Jun 21, 2011 at 12:12 PM, Sasha Dolgy<>  wrote:
>>>> Folks,
>>>> Simple question ... Assuming my current use case is the ability to log
>>>> lots of trivial and seemingly useless sports statistics ... I want a
>>>> user to be able to query / compare .... For example:
>>>> -->  Show me all baseball players in cheektowaga and ontario,
>>>> california who have hit a grandslam on tuesdays where it was just a
>>>> leap year.
>>>> Each baseball player is represented by a single row in a CF:
>>>> player_uuid, fullname, hometown, game1, game2, game3, game4
>>>> Game's are UUID's that are a reference to another row in the same CF
>>>> that provides information about that game...
>>>> location, final score, date (unix timestamp or ISO format) , and
>>>> statitics which are represented as a new column timestamp:player_uuid
>>>> I can use PIG, as I understand, to run a query to generate specific
>>>> information about specific "things" and populate that data back into
>>>> Cassandra in another CF ... similar to the hypothetical search
>>>> the information is structured already, i assume PIG is the
>>>> right tool for the job, but may not be ideal for a web application and
>>>> enabling ad-hoc queries ... it could take anywhere from 2-....?
>>>> seconds for that query to generate, populate, and return to the
>>>> user...?
>>>> On the other hand, I have started to read about Solr / Solandra /
>>>> Lucandra .... can this provide similar functionality or better ?  or
>>>> is it more geared towards full text search and indexing ...
>>>> I don't want to get into the habit of guessing what my potential users
>>>> want to search for ... trying to think of ways to offload this to
>>>> them.
>>>> --
>>>> Sasha Dolgy
>>> --

View raw message