Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of synfinatic@gmail.com
 designates 209.85.217.181 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CADiHtmv4abzvD7Ao0bRZtwxTsM-otPshnD54PtGkE+nNob8p5Q@mail.gmail.com>
References: 
 <CADiHtmsM_eiO=ECUM4tuu5vqj=uopYENW9_U+9j5DNJH8Cg33Q@mail.gmail.com>
 <84B566FB5B7B244B81E6F1FEADA9087701D2A9E4DF@LDNPCMMGMB01.INTRANET.BARCAPINT.COM>
 <7972FBD3-3E05-4EB6-861B-680D0CF2421B@thelastpickle.com>
 <CADiHtmv4abzvD7Ao0bRZtwxTsM-otPshnD54PtGkE+nNob8p5Q@mail.gmail.com>
From: Aaron Turner <synfinatic@gmail.com>
Date: Wed, 13 Mar 2013 11:38:20 -0700
Message-ID: 
 <CANAZdzVP1oc15Joc=p3VX0A_=R361x26GXoTZyCZ++UPLXJ6-g@mail.gmail.com>
Subject: Re: data model to store large volume syslog
To: user@cassandra.apache.org
Content-Type: text/plain; charset=ISO-8859-1

On Wed, Mar 13, 2013 at 4:23 AM, Mohan L <l.mohanphy@gmail.com> wrote:
>
>
> On Fri, Mar 8, 2013 at 9:42 PM, aaron morton <aaron@thelastpickle.com>
> wrote:
>>
>> > 1). create a column family 'cfrawlog' which stores raw log as received.
>> > row key could be 'yyyyddmmhh'(new row is added for each hour or less), each
>> > 'column name' is uuid with 'value' is raw log data. Since we are also going
>> > to use this log for forensics purpose, so it will help us to have all raw
>> > log with in the column family without missing.
>> As Moshe said there is a chance of hot spotting if you are sending all
>> writes to a certain row.
>> You also need to consider how big the row will get, in general stay below
>> about 30MB. You can go higher but there are some implications.
>>
>>
>> > 2). I want to create one more column family which is going to have the
>> > parsed log so that we will use this column family to query. my question is
>> > How to model this CF so that it will give answer of the above question? what
>> > would be the row key for this CF?
>> Something like:
>>
>> row_key: YYYYMMDD
>> column: <host:timestamp:>
>>
>> Note, i've not considered how to handle duplicate time stamps from the
>> same host
>
>
> I have created a standard column family with:
>
> row_key : <YYYYMMDDHH:hostname>
> Column_Name  : <timestamp:hostname>
> Column_Value (as JSON dump) : {"date": "2013-03-05 06:21:56", "hostname":
> "example.com", "error_message": "Starting checkpoint of DB.db at Tue Mar 05
> 2013 06:21"}
>
> I have two question in the above model:
>
> 1). If the column_name is same for the given row_key then Cassandra will
> update the column_value. Is there any way in to append the value in the same
> column(say first time do insert and next time do append)? Does it make sense
> my question?

You can only insert a new value which overwrites the old
rowkey/column_name pair.  The slow way is to do a read followed by a
write.  Faster is keep some kind of in-memory cache of recent inserts
so you read from memory followed by the write- obviously though that
could have scaling issues. Another solution is to write another column
and concat the values on read.

> 2). Is there any way I can search/filter based on column_value? If not
> possible,  what is the work around way to achieve this king of column_value
> based search/filter in Cassandra?


You can with indexes, but indexes really only work if your
column_names are known in advance- for your use case that's probably
not useful.  The usual solution is to insert the same data multiple
times (de-normalize your data) so that your read queries are
efficient.  Ie: depending on the query, you would probably query a
different CF.  Again, remember to distribute your writes across
multiple rows to avoid hot spots.  For example, if you want to search
by priority and facility, you'd want to encode that in the rowkey or
column_name


> say for example : The below query return subrange of the columns in a row.
> It will return all value between the range.  what will be the way to filter
> subrange output bases on their column_value?
>
> key = '2013030505example.com'
> result = col_fam.get(key,column_start='2013-03-05 05:02:11example.com',
> column_finish='2013-03-05 06:28:27example.com')
>
> Any help and suggestion would be greatly appreciated.


I'd suggest using TimeUUID for timestamp in the column name- probably
a lot wiser then rolling your own solution.

One thing I'd add, is that there is no reason to duplicate information
like the hostname in both the row key and the column name.  You're
just wasting storage at that point.  Just put it in the rowkey and be
done with it.

That said, you should think about what other kind of queries you need
to do.  Basically, you won't be able to search for anything in the
value- only by row key and column name.  So for example if you care
about the facility and priority, then you'll need to some how encode
that in the row/column name.  Otherwise you'll have to filter out
records post-query.  So for read performance, chances are you'll have
to insert the information multiple times depending on your search
parameters.

FYI, I could of sworn someone on this list announced a few months ago
some kind of C* powered syslog storage solution they had developed.
You may want to do some searches and see if you can find the project
and learn anything from that.


-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"