cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (CASSANDRA-4258) Are we sorting the bloom filters in memory to increase the probability of getting proper result instead of just avoiding the false positive?
Date Fri, 18 May 2012 15:25:10 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-4258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Ellis resolved CASSANDRA-4258.
---------------------------------------

    Resolution: Not A Problem
      Assignee:     (was: Jonathan Ellis)

BF is just a way to eliminate sstables from consideration that *don't* have the row we're
interested in.  We already order sstable checks by relevance (see CASSANDRA-2498).  Re-sorting
just the BFs would be at best worthless and at worst cause unnecessary work.
                
> Are we sorting the bloom filters in memory to increase the probability of getting proper
result instead of just avoiding the false positive?
> --------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-4258
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-4258
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>    Affects Versions: 1.1.1
>            Reporter: Samarth Gahire
>            Priority: Minor
>              Labels: bloom-filter, read
>             Fix For: 1.1.1
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> I was just wondering if there is any logic for "which bloom filter should be checked
first" to increase the probability of getting the result and not just minimizing the probability
of false positive.
> ( *Note:* I have checked into the code and I am not talking about *"Getting BloomFilter
with the lowest practical false positive probability"* OR *"Getting smallest BloomFilter that
can provide the given false positive probability rate for the given number of elements."*
)
> *Consider following Scenario:*
> 1) In our Cassandra Cluster we are inserting 130 millions of rows on daily basis for
single column family and practically we cant keep this data compacted always.(As the loading
time is much and compaction may take too much time that could affect the schedule for loading
of data for next day )
> 2) We are inserting same rowkeys(values of all the 130 millions rows are same) everyday
with different supercolumn.
> {code}
> For date 20120101 we have
> super_CF= {row_1:{_super_column_20120101:{ col1 : val1, col2 : val2 }}
>            row_2:{_super_column_20120101:{ col1 : val3, col2 : val4 }}
>            row_3:{_super_column_20120101:{ col1 : val5, col2 : val6 }}
> } 
> and For date 20120102 it will be like
> super_CF= {row_1:{_super_column_20120102:{ col1 : val7, col2 : val8 }}
>            row_2:{_super_column_20120102:{ col1 : val9, col2 : val10 }}
>            row_3:{_super_column_20120102:{ col1 : val11, col2 : val12 }}
> } 
> Note that set of rowkeys is same for all the days only supercolumn changes
> {code}
> 3) So if we do not compact the data say for 30 days, each row key is present in 30 different
sstables.
> 4) So in worst case, even with 0 probability of false positive, there could be 30 unnecessary
disk accesses.
> 5) Because of this scenario we are experiencing extremely degraded read performance.

> *Proposed solution:*
> 1) We can have some sorting of bloom-filters based on logic like the bloom filter of
the sstable which resulted into successfully serving the read request will have higher priority
over other bloom filters.
> I mean we will go for the bloom filter of the sstable which is most recently accessed
and which successfully returned the requested columns.(MRU approach, As the probability of
getting result from MRU sstable is greater).This way we can reduce the disk access.
> 2) The point is we should have some sort of logic for sorting of bloom filters to boost
the read performance in case where sstables are not yet compacted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message