cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bill <>
Subject Re: cassandra read performance on large dataset
Date Thu, 01 Dec 2011 22:30:10 GMT
 > Our largest dataset has 1200 billion rows.

Radim, out of curiosity, how many nodes is that running across?


On 28/11/11 13:44, Radim Kolar wrote:
>> I understand that my computer may be not as powerful as those used in
>> the other benchmarks,
>> but it shouldn't be that far off (1:30), right?
> cassandra has very fast writes. you can have read:write ratios like 1:1000
> pure read workload on 1 billion rows without key/row cache on 2 node cluster
> Running workload in 10 threads 1000 ops each.
> Workload took 88.59 seconds, thruput 112.88 ops/sec
> each node can do about 240 IOPS. Which means average 4 iops per read in
> cassandra on cold system.
> After OS cache warms enough to cache indirect seek blocks it gets faster
> to almost ideal:
> Workload took 79.76 seconds, thruput 200.59 ops/sec
> Ideal cassandra read performance is (without caches) is 2 IOPS per read
> -> one io to read index, second to data.
> pure write workload:
> Running workload in 40 threads 100000 ops each.
> Workload took 302.51 seconds, thruput 13222.62 ops/sec
> write is slow here because nodes are running out of memory most likely
> due to memory leaks in 1.0 branch. Also writes in this test are not batched.
> Cassandra is really awesome for its price tag. Getting similar numbers
> from Oracle will cost you way too much. For one 2 core Oracle licence
> suitable for processing large data you can get about 8 cassandra nodes -
> and dont forget that oracle needs some hardware too. Transactions are
> not always needed for data warehousing - if you are importing chunks of
> data, you do not need to do rollbacks, just schedule failed chunks for
> later processing. If you are able to code your app to work without
> transactions, cassandra is way to go.
> Hadoop and cassandra are very good products for working with large data
> basically for just price of learning new technology. Usually cassandra
> is deployed first, its easy to get it running and day-to-day operations
> are simple. Hadoop follows later after discovering that cassandra is not
> really suitable for large batch jobs because it needs random access for
> data reading.
> We finished processing migration from Commercional SQL to Hadoop/Cassa
> in 3 months, not only that it costs 10x less, we are able to process
> about 100 times larger datasets. Our largest dataset has 1200 billion rows.
> Problems with this setup are:
> bloom filters are using too much memory. they should be configurable for
> applications where read performance is unimportant
> node startup is really slow
> data loaded into cassandra are about 2 times bigger then CSV export.
> (not really problem, diskspace is cheap, but there is kinda high per row
> overhead)
> writing applications is harder then coding for SQL backend. Hadoop is
> way harder to use then cassandra.
> lack of good import/export tools for cassandra. especially lack of
> monitoring
> must have knowledge of workarounds for hadoop bugs. Hadoop is not easy
> to use efficiently.
> index overhead is too big (about 100% slower) compared to index overhead
> in SQL databases (about 20% slower)
> no delete over index
> repair is slow

View raw message