incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: data model to store large volume syslog
Date Fri, 08 Mar 2013 16:12:27 GMT
> 1). create a column family 'cfrawlog' which stores raw log as received. row key could
be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value'
is raw log data. Since we are also going to use this log for forensics purpose, so it will
help us to have all raw log with in the column family without missing. 
As Moshe said there is a chance of hot spotting if you are sending all writes to a certain
row. 
You also need to consider how big the row will get, in general stay below about 30MB. You
can go higher but there are some implications. 


> 2). I want to create one more column family which is going to have the parsed log so
that we will use this column family to query. my question is How to model this CF so that
it will give answer of the above question? what would be the row key for this CF?  
Something like:

row_key: YYYYMMDD
column: <host:timestamp:>

Note, i've not considered how to handle duplicate time stamps from the same host

> 3). Is the above data model makes sense? 
Sort of.
Do some googling for cassandra and log data, look at https://github.com/thobbs/logsandra


Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 7/03/2013, at 4:16 AM, moshe.kranc@barclays.com wrote:

> Row key based on hour will create hot spots for write – for an entire hour, all the
writes will be going to the same node, i.e., the node where the row resides. You need to come
up with a row key that distributes writes evenly across all your C* nodes, e.g., time concatenated
with a sequence counter.
>  
> From: Mohan L [mailto:l.mohanphy@gmail.com] 
> Sent: Thursday, March 07, 2013 2:10 PM
> To: user@cassandra.apache.org
> Subject: data model to store large volume syslog
>  
> 
> Dear All,
> 
> I am looking Cassandra to store time series data(mostly syslog). The volume of data is
very huge and more entries happening at the same timestamps. each record contain the following
fields.
>   
> timestamps:host-name:facility:message
> 
> The below are the things needs to be monitored: 
> 
> 
> 1). Need to get data between time X and Y
> 2). Need to get data between time X and Y for a host-name.
> 3). Need to search a 'pattern' in the message
> 
> the data model design which I am thinking is 
> 
> 1). create a column family 'cfrawlog' which stores raw log as received. row key could
be 'yyyyddmmhh'(new row is added for each hour or less), each 'column name' is uuid with 'value'
is raw log data. Since we are also going to use this log for forensics purpose, so it will
help us to have all raw log with in the column family without missing. 
> 
> 2). I want to create one more column family which is going to have the parsed log so
that we will use this column family to query. my question is How to model this CF so that
it will give answer of the above question? what would be the row key for this CF? 
> 
> 3). Is the above data model makes sense? 
> 
> Any help and suggestion would be greatly appreciated.
> 
> 
> Thanks
> Mohan L
> 
> 
> _______________________________________________
> 
> This message may contain information that is confidential or privileged. If you are not
an intended recipient of this message, please delete it and any attachments, and notify the
sender that you have received it in error. Unless specifically stated in the message or otherwise
indicated, you may not duplicate, redistribute or forward this message or any portion thereof,
including any attachments, by any means to any other person, including any retail investor
or customer. This message is not a recommendation, advice, offer or solicitation, to buy/sell
any product or service, and is not an official confirmation of any transaction. Any opinions
presented are solely those of the author and do not necessarily represent those of Barclays.
This message is subject to terms available at: www.barclays.com/emaildisclaimer and, if received
from Barclays' Sales or Trading desk, the terms available at: www.barclays.com/salesandtradingdisclaimer/.
By messaging with Barclays you consent to the foregoing. Barclays Bank PLC is a company registered
in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
This email may relate to or be sent from other members of the Barclays group.
> 
> _______________________________________________
> 


Mime
View raw message