cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Takayuki Tsunakawa" <tsunakawa.ta...@jp.fujitsu.com>
Subject Re: [Q] MapReduce behavior and Cassandra's scalability for petabytes of data
Date Tue, 26 Oct 2010 01:09:24 GMT
Hello, Edward,

Thank you for giving me insight about large disk nodes.

From: "Edward Capriolo" <edlinuxguru@gmail.com>
> Index sampling on start up. If you have very small rows your indexes
> become large. These have to be sampled on start up and sampling our
> indexes for 300Gb of data can take 5 minutes. This is going to be
> optimized soon.

5 minutes for 300 GB data ... it's not cheap, is it? Simply, 3 TB of
data will leat to 50 minutes just for computing input splits. This is
too expensive when I want only part of the 3 TB data.


> (Just wanted to note some of this as I am in the middle of a process
> of joining a node now :)

Good luck. I'd appreciate if you could some performance numbers of
joining nodes (amount of data, time to distribute data, load impact on
applications, etc) if you can. The cluster our customer is thinking of
is likely to become very large, so I'm interested in the elasticity.
Yahoo!'s YCSB report makes me worry about adding nodes.

Regards,
Takayuki Tsunakawa


From: "Edward Capriolo" <edlinuxguru@gmail.com>
[Q3]
There are some challenges with very large disk nodes.
Caveats:
I will use words like "long", "slow", and "large" relatively. If you
have great equipment IE. 10G Ethernet between nodes it will not take
"long" to transfer data. If you have an insane disk pack it may not
take "long" to compact 200GB of data. I am basing these statements on
server class hardware. ~32 GB ram ~2x processor, ~6 disk SAS RAID.

Index sampling on start up. If you have very small rows your indexes
become large. These have to be sampled on start up and sampling our
indexes for 300Gb of data can take 5 minutes. This is going to be
optimized soon.

Joining nodes: When you go with larger systems joining a new node
involves a lot of transfer, and can take a "long" time.  Node join
process is going to be optimized in 0.7 and 0.8 (quite drastic changes
in 0.7)

Major compaction and very large normal compaction can take a "long"
time. For example while doing a 200 GB compaction that takes 30
minutes, other sstables build up, more sstables mean "slower" reads.

Achieving a high RAM/DISK ratio may be easier with smaller nodes vs
one big node with 128 GB RAM $$$.

As Jonathan pointed out nothing technically is stopping larger disk
nodes.

(Just wanted to note some of this as I am in the middle of a process
of joining a node now :)



Mime
View raw message