incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Fines <>
Subject sstable2json weirdness
Date Fri, 30 Sep 2011 15:29:35 GMT
Hi all,

I've been messing with sstable2json as a means of mass-exporting some data (mainly for backups,
but also for some convenience trickery on an individual nodes' data). However, I've run into
a situation where sstable2json appears to be dumping out TONS of duplicate columns for a single

For example, for a single key, I did

$CASSANDRA_HOME/bin/sstable2json <sstable> -k <key> > output.file

which ran until I killed it manually. Then I executed
cat output.file | sed 's/]/\n/g'  | wc -l

which gave me 40 million and some change. On the other hand,

cat output.file | sed 's/]\n/g' | sort -n | uniq | wc -l

gave me around 10K (much closer to reality).

For my particular data set, the total size of any given row cannot exceed 80K columns. So
I'm wondering: Is this normal behavior for sstable2json? Assuming that it is, is there any
way in which I can massage sstable2json into not emitting duplicates? These duplicates eat
a great deal of disk space and processing power to manipulate, which I'd like to avoid.

Thanks for your help,


View raw message