Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 98A7DE1F3 for ; Wed, 13 Mar 2013 18:39:10 +0000 (UTC) Received: (qmail 28623 invoked by uid 500); 13 Mar 2013 18:39:08 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 28568 invoked by uid 500); 13 Mar 2013 18:39:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 28560 invoked by uid 99); 13 Mar 2013 18:39:07 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 18:39:07 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of synfinatic@gmail.com designates 209.85.217.181 as permitted sender) Received: from [209.85.217.181] (HELO mail-lb0-f181.google.com) (209.85.217.181) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Mar 2013 18:39:04 +0000 Received: by mail-lb0-f181.google.com with SMTP id gm6so1185583lbb.26 for ; Wed, 13 Mar 2013 11:38:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=WVbbYy2Hnxwr2dvxHNdlDxlMq8X2mjjP/SHpnQ1d/3Y=; b=G+pztXuGco9fgQg93TsVDq2icmcH0uxe8TbO/seOmp64sduswKuxtwfw2CFQ499N7W UwVaQuLDRFS9eOHU0C7djDDDPW/wJq0SuKRz0Thr7HPyr1yMA1rHmDC3u3mJIY5FpuvV 6meutRbHhRVJqAdHMR+2hpivGGM1B8G+IYm31Awz8Vz4+6H3M+mDDPkjdysHlUNsfGvl 4NP4d1kSNuT6ZOXt/FWiq///psltG6mZ+D2gYSW4hQtG62VokTrd4MZ98y5sA1AaBPt1 MS8BP70R4PQImeZ7w8pyQhFaBXfwRQJUrGc5Aq/oSJZyfLsBmGtZRtM3Yo9sabRjJ/w+ 0FBw== X-Received: by 10.112.9.231 with SMTP id d7mr498lbb.8.1363199922337; Wed, 13 Mar 2013 11:38:42 -0700 (PDT) MIME-Version: 1.0 Received: by 10.112.49.6 with HTTP; Wed, 13 Mar 2013 11:38:20 -0700 (PDT) In-Reply-To: References: <84B566FB5B7B244B81E6F1FEADA9087701D2A9E4DF@LDNPCMMGMB01.INTRANET.BARCAPINT.COM> <7972FBD3-3E05-4EB6-861B-680D0CF2421B@thelastpickle.com> From: Aaron Turner Date: Wed, 13 Mar 2013 11:38:20 -0700 Message-ID: Subject: Re: data model to store large volume syslog To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Mar 13, 2013 at 4:23 AM, Mohan L wrote: > > > On Fri, Mar 8, 2013 at 9:42 PM, aaron morton > wrote: >> >> > 1). create a column family 'cfrawlog' which stores raw log as received. >> > row key could be 'yyyyddmmhh'(new row is added for each hour or less), each >> > 'column name' is uuid with 'value' is raw log data. Since we are also going >> > to use this log for forensics purpose, so it will help us to have all raw >> > log with in the column family without missing. >> As Moshe said there is a chance of hot spotting if you are sending all >> writes to a certain row. >> You also need to consider how big the row will get, in general stay below >> about 30MB. You can go higher but there are some implications. >> >> >> > 2). I want to create one more column family which is going to have the >> > parsed log so that we will use this column family to query. my question is >> > How to model this CF so that it will give answer of the above question? what >> > would be the row key for this CF? >> Something like: >> >> row_key: YYYYMMDD >> column: >> >> Note, i've not considered how to handle duplicate time stamps from the >> same host > > > I have created a standard column family with: > > row_key : > Column_Name : > Column_Value (as JSON dump) : {"date": "2013-03-05 06:21:56", "hostname": > "example.com", "error_message": "Starting checkpoint of DB.db at Tue Mar 05 > 2013 06:21"} > > I have two question in the above model: > > 1). If the column_name is same for the given row_key then Cassandra will > update the column_value. Is there any way in to append the value in the same > column(say first time do insert and next time do append)? Does it make sense > my question? You can only insert a new value which overwrites the old rowkey/column_name pair. The slow way is to do a read followed by a write. Faster is keep some kind of in-memory cache of recent inserts so you read from memory followed by the write- obviously though that could have scaling issues. Another solution is to write another column and concat the values on read. > 2). Is there any way I can search/filter based on column_value? If not > possible, what is the work around way to achieve this king of column_value > based search/filter in Cassandra? You can with indexes, but indexes really only work if your column_names are known in advance- for your use case that's probably not useful. The usual solution is to insert the same data multiple times (de-normalize your data) so that your read queries are efficient. Ie: depending on the query, you would probably query a different CF. Again, remember to distribute your writes across multiple rows to avoid hot spots. For example, if you want to search by priority and facility, you'd want to encode that in the rowkey or column_name > say for example : The below query return subrange of the columns in a row. > It will return all value between the range. what will be the way to filter > subrange output bases on their column_value? > > key = '2013030505example.com' > result = col_fam.get(key,column_start='2013-03-05 05:02:11example.com', > column_finish='2013-03-05 06:28:27example.com') > > Any help and suggestion would be greatly appreciated. I'd suggest using TimeUUID for timestamp in the column name- probably a lot wiser then rolling your own solution. One thing I'd add, is that there is no reason to duplicate information like the hostname in both the row key and the column name. You're just wasting storage at that point. Just put it in the rowkey and be done with it. That said, you should think about what other kind of queries you need to do. Basically, you won't be able to search for anything in the value- only by row key and column name. So for example if you care about the facility and priority, then you'll need to some how encode that in the row/column name. Otherwise you'll have to filter out records post-query. So for read performance, chances are you'll have to insert the information multiple times depending on your search parameters. FYI, I could of sworn someone on this list announced a few months ago some kind of C* powered syslog storage solution they had developed. You may want to do some searches and see if you can find the project and learn anything from that. -- Aaron Turner http://synfin.net/ Twitter: @synfinatic http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. -- Benjamin Franklin "carpe diem quam minimum credula postero"