On Sat, Apr 21, 2012 at 1:05 AM, Jake Luciani <jakers@gmail.com> wrote:
What other solutions are you considering?  Any OLTP style access of 200TB of data will require substantial IO.

We currently use an in-house written database because when we first started our system there was nothing that handled our problem economically. We would like to use something more off the shelf to reduce maintenance and development costs.

We've been looking at Hadoop for the computational component. However it looks like HDFS does not map to our storage patterns well as the latency is quite high. In addition the resilience model of the Name Node is a concern in our environment.

We were thinking through whether using Cassandra for the Hadoop data store is viable for us, however we've come to the conclusion that it doesn't map well in this case.
 

Do you know how big your working dataset will be?  

The system is batch, jobs could range between very small up to a moderate percentage of the data set. It' even possible that we could need to read the entire data set. How much we get resident is a cost/performance trade-off we need to make

cheers
 

-Jake


On Fri, Apr 20, 2012 at 3:30 AM, Franc Carter <franc.carter@sirca.org.au> wrote:
On Fri, Apr 20, 2012 at 6:27 AM, aaron morton <aaron@thelastpickle.com> wrote:
Couple of ideas:

* is there repetition in the binary data ? Can you save space by implementing content addressable storage ? 

The data is already very highly space optimised. We've come to the conclusion that Cassandra is probably not the right fit the use case this time

cheers
 
 
Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton

On 20/04/2012, at 12:55 AM, Dave Brosius wrote:

I think your math is 'relatively' correct. It would seem to me you should focus on how you can reduce the amount of storage you are using per item, if at all possible, if that node count is prohibitive.

On 04/19/2012 07:12 AM, Franc Carter wrote:

Hi,

One of the projects I am working on is going to need to store about 200TB of data - generally in manageable binary chunks. However, after doing some rough calculations based on rules of thumb I have seen for how much storage should be on each node I'm worried.

  200TB with RF=3 is 600TB = 600,000GB
  Which is 1000 nodes at 600GB per node

I'm hoping I've missed something as 1000 nodes is not viable for us.

cheers

--
Franc Carter | Systems architect | Sirca Ltd
Level 9, 80 Clarence St, Sydney NSW 2000
PO Box H58, Australia Square, Sydney NSW 1215






--

Franc Carter | Systems architect | Sirca Ltd

franc.carter@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215





--
http://twitter.com/tjake



--

Franc Carter | Systems architect | Sirca Ltd

franc.carter@sirca.org.au | www.sirca.org.au

Tel: +61 2 9236 9118

Level 9, 80 Clarence St, Sydney NSW 2000

PO Box H58, Australia Square, Sydney NSW 1215