cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yiming Sun <yiming....@gmail.com>
Subject 8 million Cassandra data files on disk
Date Tue, 02 Aug 2011 20:09:54 GMT
Hi,

I am new to Cassandra, and am hoping someone could help me understand the
(large amount of small) data files on disk that Cassandra generates.

The reason we are using Cassandra is because we are dealing with thousands
to millions of small text files on disk, so we are experimenting with
Cassandra hoping that by dropping the files contents into Cassandra, it will
achieve more efficient disk usage because Cassandra is going to aggregate
them into bigger files (one file per column family, according to the wiki).

But after we pushed a subset of the files into a single node Cassandra
v0.7.0 instance, we noted that in the Cassandra data directory for the
keyspace, there are 8.5 million very small files, most are named

    <SuperColumnFamilyName>-e-<nnnnn>.Filter.db
    <SuperColumnFamilyName>-e-<nnnnn>.Compacted.db
    <SuperColumnFamilyName>-e-<nnnnn>.Index.db
    <SuperColumnFamilyName>-e-<nnnnn>.Statistics.db

and among these files, the Compacted.db are always empty,  Filter and Index
are under 100 bytes, and Statistics are around 4k.

What are these files? Why are there so many of them?  We originally hope
that Cassandra was going to solve our issue with the small files we have,
but now it doesn't seem to help -- we still end up with tons of small
files.   Is there any way to reduce/combine these small files?

Thanks.

-- Y.

Mime
View raw message