cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-3003) Trunk single-pass streaming doesn't handle large row correctly
Date Thu, 11 Aug 2011 00:17:27 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-3003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082800#comment-13082800
] 

Stu Hood commented on CASSANDRA-3003:
-------------------------------------

bq. I really think it is not very hard to do 'inline'. We really just want to deserialize,
cleanup, reserialize. It should be super easy to add some "CounterCleanedRow" that does that.
I'm probably missing something, but isn't the problem that this can't be done without two
passes for rows that are too large to fit in memory? And you can't perform two passes without
buffering data somewhere? I suggested removing the cleanup step out of streaming because then
the row could be echoed to disk without modification.

bq. It would also be less efficient, because until we have compacted the streamed sstable,
each read will have to call the cleanup over and over
This is true, but compaction is fairly likely to trigger soon after a big batch of streamed
files arrives, since they will trigger compaction thresholds.

> Trunk single-pass streaming doesn't handle large row correctly
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-3003
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-3003
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Yuki Morishita
>            Priority: Critical
>              Labels: streaming
>
> For normal column family, trunk streaming always buffer the whole row into memory. In
uses
> {noformat}
>   ColumnFamily.serializer().deserializeColumns(in, cf, true, true);
> {noformat}
> on the input bytes.
> We must avoid this for rows that don't fit in the inMemoryLimit.
> Note that for regular column families, for a given row, there is actually no need to
even recreate the bloom filter of column index, nor to deserialize the columns. It is enough
to filter the key and row size to feed the index writer, but then simply dump the rest on
disk directly. This would make streaming more efficient, avoid a lot of object creation and
avoid the pitfall of big rows.
> Counters column family are unfortunately trickier, because each column needs to be deserialized
(to mark them as 'fromRemote'). However, we don't need to do the double pass of LazilyCompactedRow
for that. We can simply use a SSTableIdentityIterator and deserialize/reserialize input as
it comes.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message