cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Soboroff <>
Subject Scaling problems
Date Tue, 18 May 2010 13:24:30 GMT
I hope this isn't too much of a newbie question.  I am using Cassandra 0.6.1
on a small cluster of Linux boxes - 14 nodes, each with 8GB RAM and 5 data
drives.  The nodes are running HDFS to serve files within the cluster, but
at the moment the rest of Hadoop is shut down.  I'm trying to load a large
set of web pages (the ClueWeb collection, but more is coming) and my
Cassandra daemons keep dying.

I'm loading the pages into a simple column family that lets me fetch out
pages by an internal ID or by URL.  The biggest thing in the row is the page
content, maybe 15-20k per page of raw HTML.  There aren't a lot of columns.
I tried Thrift, Hector, and the BMT interface, and at the moment I'm doing
batch mutations over Thrift, about 2500 pages per batch, because that was
fastest for me in testing.

At this point, each Cassandra node has between 500GB and 1.5TB according to
nodetool ring.  Let's say I start the daemons up, and they all go live after
a couple minutes of scanning the tables.  I then start my importer, which is
a single Java process reading Clueweb bundles over HDFS, cutting them up,
and sending the mutations to Cassandra.  I only talk to one node at a time,
switching to a new node when I get an exception.  As the job runs over a few
hours, the Cassandra daemons eventually fall over, either with no error in
the log or reporting that they are out of heap.

Each daemon is getting 6GB of RAM and has scads of disk space to play with.
I've set the storage-conf.xml to take 256MB in a memtable before flushing
(like the BMT case), and to do batch commit log flushes, and to not have any
caching in the CFs.  I'm sure I must be tuning something wrong.  I would
eventually like this Cassandra setup to serve a light request load but over
say 50-100 TB of data.  I'd appreciate any help or advice you can offer.


View raw message