cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Richter <...@tricnet.de>
Subject Node OOM, Slice query - missing data?
Date Wed, 02 Nov 2011 23:13:00 GMT
Hi there,

We run a 3 node cluster with 0.7.8 with replication factor 3 for all key 
spaces.

We store external->internal key mappings in a column family with one row 
for each customer. The largest row contains abount 200k columns.
If we import external data we load the whole row and map external to 
internal keys. Loading is done like

SliceQuery<String, Key, Mapping> q =
createSliceQuery(
		keyspace,
		getNewStringSerializer(),
		KeySerializer.get(),
		MappingSerializer.get());
q.setColumnFamily(CF_MAPPING);
q.setKey(key);
final int chunkSize = 1000;
Key start = null;
do {
	q.setRange(start, null, false, chunkSize);
	QueryResult<ColumnSlice<Key, Mapping>> r = q.execute();
	final List<HColumn<Key, Mapping>> columns = r.get().getColumns();
	for (final HColumn<Key, Mapping> c : columns) {
		... (add to list)
	}
	if (columns.size() == chunkSize) {
		start = columns.get(columns.size() - 1).getName();
	} else {
		start = null;
	}
} while (start != null);

The code ran fine for several months. Some days ago the code above 
returned much less columns than expected (e.g. 1010 instead of 198k or 
14k instead of 44k).
Is there something wrong with the code?
As a result we created and stored new mappings and now everything is 
fine again.

We realized that we had trouble with one node right before that 
behaviour so we think that's the cause.

The node went down because of OOM, and during restart another OOM killed 
the node again. One or two OOMs later the node started without any 
trouble and all seemed fine. Some hours later the next import process 
ran and then we could not read all the expected data.

As this happened two days ago at least a minor compaction took place so 
all sstables after the node crash have been merged.

Is this a known issue or can somebody imaging what's the cause? If we 
are lucky we have a backup after the crash and before the "repair", but 
if not I don't have any ideas left how to figure out what happened.

So any idea about how to dig deeper into this is very welcome.

Best,

Thomas

Mime
View raw message