I believe one of the reason is all the metadata. As far as I
understand what you said,
you have 500 millions rows with each having only one column. The
problem is that
a row have a bunch of metadata: a bloom filter, a column index plus a
few other bytes
to store the number of column, if the row is marked to be deleted and such.
In you case, the index will have one entry but this entry includes
twice the name of
the column and 2 other long.
As for the column itself, you said it is 110 bytes but maybe haven't
you counted the timestamp,
and each column has a flag saying if it is a tombstone or not.
In the end, I don't know how your column size splits between the
column key and the column value,
but I wouldn't be surprise that the math add up at the end.
Note that if you had say 5 millions rows each having 100 columns, you
would have much less
metadata and I bet you would en up with much less disk used.
On Fri, Apr 30, 2010 at 9:24 AM, Bingbing Liu <email@example.com> wrote:
> i insert 500,000,000 rows each of which has a key of 20 bytes and a column of 110 bytes.
> and the repilcationfactor is set to 3, so i expect the load of the cluster should be 0.5 billion * 130 * 3 = 195 G bytes.
> but in the fact the load i get through "nodetool -h localhost ring" is about 443G.
> i think there is some other additional datas such as index , checksum ,and the column name be stored.
> but am i right ? is that all ? why the difference is so big ?
> hope i have explained my problem clearly
> Bingbing Liu