cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arto Bendiken (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CASSANDRA-744) [multi_]get_count should take a SlicePredicate
Date Tue, 11 May 2010 09:54:43 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12866113#action_12866113
] 

Arto Bendiken commented on CASSANDRA-744:
-----------------------------------------

I guess it depends on your use case. We have one where each Cassandra row represents a very
large set, each column name being a 20-byte SHA-1 binary hash identifying an object in that
set and each such column's value being simply the empty string. As I mentioned, we've stored
up to a hundred million columns per row in this manner. As each SHA-1 column takes 35.5 bytes
of space in the SStables, that's a total of less than 4 gigs of disk storage for a row with
100 million columns. On the big iron we've run this on, these are not _inherently_ infeasible
numbers. The limiting factor is Cassandra's implementation, not the hardware.

Counting the number of objects in a given set (i.e. the number of columns in a given row)
is an important operation for us. It's fine for the count to take a while, as it is still
vastly (many, many orders of magnitude) faster than the infeasible alternative of directly
counting the source data (also stored in Cassandra, but apart) that the set data is derived
from, which would (prior to your multi_get_count patch, which does alleviate it a little)
involve performing an individual get_count operation for each of hundreds of millions (soon
to be billions) of distinct source rows.

Now, given existing GC and SStable compaction issues that we've run into with Cassandra 0.6,
we're in practice now manually sharding the larger sets into multiple rows of a size that
Cassandra has less issues dealing with (on our hardware, up to 15-20 million columns per row
is performing very well).

t expect that as Cassandra evolves and issues are fixed, we can keep upping this, and I don't
see anything inherently ridiculous about rows of the size I've mentioned. It seems a little
shortsighted to place incidental limits on the protocol, but then again I suppose the protocol
will have broken backwards compatibility a couple of times by the time I get around to testing
2 billion columns with some future Cassandra 1.x version - so perhaps we can revisit this
in a year or two ;-)


> [multi_]get_count should take a SlicePredicate
> ----------------------------------------------
>
>                 Key: CASSANDRA-744
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-744
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Sylvain Lebresne
>            Priority: Minor
>             Fix For: 0.7
>
>         Attachments: 0001-Add-SlicePredicate-to-get_count.patch, 0002-Add-mutliget_count.patch
>
>
> both to make it more flexible, and to emphasize that counting "everything" is as bad
as slicing it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message