Hi,

I am new to Cassandra, and am hoping someone could help me understand the (large amount of small) data files on disk that Cassandra generates.

The reason we are using Cassandra is because we are dealing with thousands to millions of small text files on disk, so we are experimenting with Cassandra hoping that by dropping the files contents into Cassandra, it will achieve more efficient disk usage because Cassandra is going to aggregate them into bigger files (one file per column family, according to the wiki).

But after we pushed a subset of the files into a single node Cassandra v0.7.0 instance, we noted that in the Cassandra data directory for the keyspace, there are 8.5 million very small files, most are named

    <SuperColumnFamilyName>-e-<nnnnn>.Filter.db
    <SuperColumnFamilyName>-e-<nnnnn>.Compacted.db
    <SuperColumnFamilyName>-e-<nnnnn>.Index.db
    <SuperColumnFamilyName>-e-<nnnnn>.Statistics.db

and among these files, the Compacted.db are always empty,  Filter and Index are under 100 bytes, and Statistics are around 4k.

What are these files? Why are there so many of them?  We originally hope that Cassandra was going to solve our issue with the small files we have, but now it doesn't seem to help -- we still end up with tons of small files.   Is there any way to reduce/combine these small files?

Thanks.

-- Y.