Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Message-ID: <4F2FD3CF.6090004@innotec-data.de>
Date: Mon, 06 Feb 2012 14:21:19 +0100
From: Heiner Bunjes <bunjes@innotec-data.de>
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:9.0) Gecko/20111222 Thunderbird/9.0.1
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Need database to log and retrieve sensor data
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I need a database to log and retrieve sensor data.

Is cassandra the right solution for this task and if, how should I
set it up and which access methods should I use?
If not, which other DB system might be a better fit?


The details are as follows:

######## <requirements version="4">

Glossary

- Node = A computer on which an instance of the database
   is running

- Blip = one data record send by a sensor

- Blip page = The sorted list of all blips for a specific sensor
   and a specific time range.


The scale is as follows:

(01) 10E6 sensors deliver 1 blip every 100 seconds
      -> Insert rate = 10 kiloblip/s
      -> Insert rate ~ 315 gigablip/Year

(02) They have to be stored for ~3 years
      -> Size of database = 1 terablip

(03) Each blip has about 200 bytes
      -> Size of database = 200TB

(04) The system will start with just 10E4 sensors but will
      soon increase upto the described volume.


The main operations on the data are:

(05) Add the new blips to the database
      (written blips are never changed)!

(06) Return all blips for sensor X with a timestamp
      between timestamp_a and timestamp_b!
      With other words: Return a blip page.

(07) Return all the blips specified in (06) ordered
      by timestamp!

(08) Delete all blips older than Y!


Further the following is true:

(09) Each added blip is clearly (without ambiguity) identifiable by
      sensor_id+timestamp.

(10) 99.9% of the blips are inserted in
      chronological order, the rest is not.

(11) The database system MUST be free and open source.

(12) The DB SHOULD be easy to administrate.

(13) All data MUST still be writable and readable while less
      then the configurable number N of nodes are down (unexpectedly).

(14) The mechanisms to distribute the data to the available
      nodes SHOULD be handled by the database.
      This means that the database SHOULD automatically
      redistribute the data when nodes are added or removed.

(15) The project is mainly implemented in erlang, so there must be
      a stable erlang interface for database access.

######## </requirements>


Many thanks in advance
Heiner