cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anishek Agarwal <anis...@gmail.com>
Subject Re: High Bloom filter false ratio
Date Tue, 23 Feb 2016 08:37:29 GMT
Looks like that sstablemetadata is available in 2.2 , we are on 2.0.x do
you know anything that will work on 2.0.x

On Tue, Feb 23, 2016 at 1:48 PM, Anishek Agarwal <anishek@gmail.com> wrote:

> Thanks Jeff, Awesome will look at the tools and JMX endpoint.
>
> our settings are below originated from the jira you posted above as the
> base. we are running on 48 core machines with 2 SSD disks of 800 GB each .
>
> MAX_HEAP_SIZE="6G"
>
> HEAP_NEWSIZE="4G"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=6"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=4"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=70"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseTLAB"
>
> JVM_OPTS="$JVM_OPTS -XX:MaxPermSize=256m"
>
> JVM_OPTS="$JVM_OPTS -XX:+AggressiveOpts"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
> JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
>
> JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=48"
>
> JVM_OPTS="$JVM_OPTS -XX:-ExplicitGCInvokesConcurrent"
>
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
>
> JVM_OPTS="$JVM_OPTS -XX:+UseGCTaskAffinity"
>
> JVM_OPTS="$JVM_OPTS -XX:+BindGCTaskThreadsToCPUs"
>
> # earlier value 131072
>
> JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=32678"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSScheduleRemarkEdenSizeThreshold=104857600"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSRescanMultiple=32678"
>
> JVM_OPTS="$JVM_OPTS -XX:CMSConcMarkMultiple=32678"
>
>
> On Tue, Feb 23, 2016 at 1:06 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
> wrote:
>
>> There exists a JMX endpoint called forceUserDefinedCompaction that takes
>> a comma separated list of sstables to compact together.
>>
>> There also exists a tool called sstablemetadata (may be in a
>> ‘cassandra-tools’ package separate from whatever package you used to
>> install cassandra, or in the tools/ directory of your binary package).
>> Using sstablemetadata, you can look at the maxTimestamp for each table, and
>> the ‘Estimated droppable tombstones’. Using those two fields, you could,
>> very easily, write a script that gives you a list of sstables that you
>> could feed to forceUserDefinedCompaction to join together to eliminate
>> leftover waste.
>>
>> Your long ParNew times may be fixable by increasing the new gen size of
>> your heap – the general guidance in cassandra-env.sh is out of date, you
>> may want to reference CASSANDRA-8150 for “newer” advice (
>> http://issues.apache.org/jira/browse/CASSANDRA-8150 )
>>
>> - Jeff
>>
>> From: Anishek Agarwal
>> Reply-To: "user@cassandra.apache.org"
>> Date: Monday, February 22, 2016 at 8:33 PM
>>
>> To: "user@cassandra.apache.org"
>> Subject: Re: High Bloom filter false ratio
>>
>> Hey Jeff,
>>
>> Thanks for the clarification, I did not explain my self clearly, the max_stable_age_days
>> is set to 30 days and the ttl on every insert is set to 30 days also
>> by default. gc_grace_seconds is 0, so i would think the sstable as a whole
>> would be deleted.
>>
>> Because of the problems mentioned by at 1) above it looks like, there
>> might be cases where the table just lies around since no compaction is
>> happening on it and even though everything is expired it would still not be
>> deleted?
>>
>> for 3) the average read is pretty good, though the throughput doesn't
>> seem to be that great, when no repair is running we get GCIns > 200ms every
>> couple of hours once, otherwise its every 10-20 mins
>>
>> INFO [ScheduledTasks:1] 2016-02-23 05:15:03,070 GCInspector.java (line
>> 116) GC for ParNew: 205 ms for 1 collections, 1712439128 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 08:30:47,709 GCInspector.java (line
>> 116) GC for ParNew: 242 ms for 1 collections, 1819126928 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:09:55,085 GCInspector.java (line
>> 116) GC for ParNew: 374 ms for 1 collections, 1829660304 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:11:21,245 GCInspector.java (line
>> 116) GC for ParNew: 419 ms for 1 collections, 2309875224 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:35:50,717 GCInspector.java (line
>> 116) GC for ParNew: 231 ms for 1 collections, 2515325328 used; max is
>> 7784628224
>>
>>  INFO [ScheduledTasks:1] 2016-02-23 09:38:47,194 GCInspector.java (line
>> 116) GC for ParNew: 252 ms for 1 collections, 1724241952 used; max is
>> 7784628224
>>
>>
>> our reading patterns are dependent on BF to work efficiently as we do a
>> lot of reads for keys that may not exists because its time series and
>> we segregate data based on hourly boundary from epoch.
>>
>>
>> hey Christoper,
>>
>> yes every row in the stable that should have been deleted has "d" in that
>> column. Also the key for one of the row is as
>>
>> "key": "0008000000000cdd5edd000008000000000006251000"
>>
>>
>>
>> how do i get it back to normal readable format to get the (long,long) --
>> composite partition key back?
>>
>> Looks like i have to force a major compaction to delete a lot of data ?
>> are there any other solutions ?
>>
>> thanks
>> anishek
>>
>>
>>
>> On Mon, Feb 22, 2016 at 11:21 PM, Jeff Jirsa <jeff.jirsa@crowdstrike.com>
>> wrote:
>>
>>> 1) getFullyExpiredSSTables in 2.0 isn’t as thorough as many expect, so
>>> it’s very likely that some sstables stick around longer than you expect.
>>>
>>> 2) max_sstable_age_days tells cassandra when to stop compacting that
>>> file, not when to delete it.
>>>
>>> 3) You can change the window size using both the base_time_seconds
>>> parameter and max_sstable_age_days parameter (use the former to set the
>>> size of the first window, and the latter to determine how long before you
>>> stop compacting that window). It’s somewhat non-intuitive.
>>>
>>> Your read latencies actually look pretty reasonable, are you sure you’re
>>> not simply hitting GC pauses that cause your queries to run longer than you
>>> expect? Do you have graphs of GC time (first derivative of total gc time is
>>> common for tools like graphite), or do you see ‘gcinspector’ in your logs
>>> indicating pauses > 200ms?
>>>
>>> From: Anishek Agarwal
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Sunday, February 21, 2016 at 11:13 PM
>>> To: "user@cassandra.apache.org"
>>> Subject: Re: High Bloom filter false ratio
>>>
>>> Hey guys,
>>>
>>> Just did some more digging ... looks like DTCS is not removing old data
>>> completely, I used sstable2json for one such table and saw old data there.
>>> we have a value of 30 for  max_stable_age_days for the table.
>>>
>>> One of the columns showed data as :["2015-12-10 11\\:03+0530:",
>>> "56690ea2", 1449725602552000, "d"] what is the meaning of "d" in the last
>>> IS_MARKED_FOR_DELETE column ?
>>>
>>> I see data from 10 dec 2015 still there, looks like there are a few
>>> issues with DTCS, Operationally what choices do i have to rectify this, We
>>> are on version 2.0.15.
>>>
>>> thanks
>>> anishek
>>>
>>>
>>>
>>>
>>> On Mon, Feb 22, 2016 at 10:23 AM, Anishek Agarwal <anishek@gmail.com>
>>> wrote:
>>>
>>>> We are using DTCS have a 30 day window for them before they are cleaned
>>>> up. I don't think with DTCS we can do anything about table sizing. Please
>>>> do let me know if there are other ideas.
>>>>
>>>> On Sat, Feb 20, 2016 at 12:51 AM, Jaydeep Chovatia <
>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>
>>>>> To me following three looks on higher side:
>>>>> SSTable count: 1289
>>>>>
>>>>> In order to reduce SSTable count see if you are compacting of not (If
>>>>> using STCS). Is it possible to change this to LCS?
>>>>>
>>>>>
>>>>> Number of keys (estimate): 345137664 (345M partition keys)
>>>>>
>>>>> I don't have any suggestion about reducing this unless you partition
>>>>> your data.
>>>>>
>>>>>
>>>>> Bloom filter space used, bytes: 493777336 (400MB is huge)
>>>>>
>>>>> If number of keys are reduced then this will automatically reduce
>>>>> bloom filter size I believe.
>>>>>
>>>>>
>>>>>
>>>>> Jaydeep
>>>>>
>>>>> On Thu, Feb 18, 2016 at 7:52 PM, Anishek Agarwal <anishek@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> @Jaydeep here is the cfstats output from one node.
>>>>>>
>>>>>> Read Count: 1721134722
>>>>>>
>>>>>> Read Latency: 0.04268825050756254 ms.
>>>>>>
>>>>>> Write Count: 56743880
>>>>>>
>>>>>> Write Latency: 0.014650376727851532 ms.
>>>>>>
>>>>>> Pending Tasks: 0
>>>>>>
>>>>>> Table: user_stay_points
>>>>>>
>>>>>> SSTable count: 1289
>>>>>>
>>>>>> Space used (live), bytes: 122141272262
>>>>>>
>>>>>> Space used (total), bytes: 224227850870
>>>>>>
>>>>>> Off heap memory used (total), bytes: 653827528
>>>>>>
>>>>>> SSTable Compression Ratio: 0.4959736121441446
>>>>>>
>>>>>> Number of keys (estimate): 345137664
>>>>>>
>>>>>> Memtable cell count: 339034
>>>>>>
>>>>>> Memtable data size, bytes: 106558314
>>>>>>
>>>>>> Memtable switch count: 3266
>>>>>>
>>>>>> Local read count: 1721134803
>>>>>>
>>>>>> Local read latency: 0.048 ms
>>>>>>
>>>>>> Local write count: 56743898
>>>>>>
>>>>>> Local write latency: 0.018 ms
>>>>>>
>>>>>> Pending tasks: 0
>>>>>>
>>>>>> Bloom filter false positives: 40664437
>>>>>>
>>>>>> Bloom filter false ratio: 0.69058
>>>>>>
>>>>>> Bloom filter space used, bytes: 493777336
>>>>>>
>>>>>> Bloom filter off heap memory used, bytes: 493767024
>>>>>>
>>>>>> Index summary off heap memory used, bytes: 91677192
>>>>>>
>>>>>> Compression metadata off heap memory used, bytes: 68383312
>>>>>>
>>>>>> Compacted partition minimum bytes: 104
>>>>>>
>>>>>> Compacted partition maximum bytes: 1629722
>>>>>>
>>>>>> Compacted partition mean bytes: 1773
>>>>>>
>>>>>> Average live cells per slice (last five minutes): 0.0
>>>>>>
>>>>>> Average tombstones per slice (last five minutes): 0.0
>>>>>>
>>>>>>
>>>>>> @Tyler Hobbs
>>>>>>
>>>>>> we are using cassandra 2.0.15 so
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525  shouldnt
>>>>>> occur. Other problems looks like will be fixed in 3.0 .. we will
mostly try
>>>>>> and slot in an upgrade to 3.x version towards second quarter of this
year.
>>>>>>
>>>>>>
>>>>>> @Daemon
>>>>>>
>>>>>> Latencies seem to have higher ratios, attached is the graph.
>>>>>>
>>>>>>
>>>>>> I am mostly trying to look at Bloom filters, because the way we do
>>>>>> reads, we read data with non existent partition keys and it seems
to be
>>>>>> taking long to respond, like for 720 queries it takes 2 seconds,
with all
>>>>>> 721 queries not returning anything. the 720 queries are done in
>>>>>> sequence of 180 queries each with 180 of them running in parallel.
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>> anishek
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 19, 2016 at 3:09 AM, Jaydeep Chovatia <
>>>>>> chovatia.jaydeep@gmail.com> wrote:
>>>>>>
>>>>>>> How many partition keys exists for the table which shows this
>>>>>>> problem (or provide nodetool cfstats for that table)?
>>>>>>>
>>>>>>> On Thu, Feb 18, 2016 at 11:38 AM, daemeon reiydelle <
>>>>>>> daemeonr@gmail.com> wrote:
>>>>>>>
>>>>>>>> The bloom filter buckets the values in a small number of
buckets. I
>>>>>>>> have been surprised by how many cases I see with large cardinality
where a
>>>>>>>> few values populate a given bloom leaf, resulting in high
false positives,
>>>>>>>> and a surprising impact on latencies!
>>>>>>>>
>>>>>>>> Are you seeing 2:1 ranges between mean and worse case latencies
>>>>>>>> (allowing for gc times)?
>>>>>>>>
>>>>>>>> Daemeon Reiydelle
>>>>>>>> On Feb 18, 2016 8:57 AM, "Tyler Hobbs" <tyler@datastax.com>
wrote:
>>>>>>>>
>>>>>>>>> You can try slightly lowering the bloom_filter_fp_chance
on your
>>>>>>>>> table.
>>>>>>>>>
>>>>>>>>> Otherwise, it's possible that you're repeatedly querying
one or
>>>>>>>>> two partitions that always trigger a bloom filter false
positive.  You
>>>>>>>>> could try manually tracing a few queries on this table
(for non-existent
>>>>>>>>> partitions) to see if the bloom filter rejects them.
>>>>>>>>>
>>>>>>>>> Depending on your Cassandra version, your false positive
ratio
>>>>>>>>> could be inaccurate:
>>>>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8525
>>>>>>>>>
>>>>>>>>> There are also a couple of recent improvements to bloom
filters:
>>>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-8413
>>>>>>>>> * https://issues.apache.org/jira/browse/CASSANDRA-9167
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Feb 18, 2016 at 1:35 AM, Anishek Agarwal <
>>>>>>>>> anishek@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> We have a table with composite partition key with
humungous
>>>>>>>>>> cardinality, its a combination of (long,long). On
the table we have
>>>>>>>>>> bloom_filter_fp_chance=0.010000.
>>>>>>>>>>
>>>>>>>>>> On doing "nodetool cfstats" on the 5 nodes we have
in the cluster
>>>>>>>>>> we are seeing  "Bloom filter false ratio:" in the
range of 0.7 -0.9.
>>>>>>>>>>
>>>>>>>>>> I thought over time the bloom filter would adjust
to the key
>>>>>>>>>> space cardinality, we have been running the cluster
for a long time now but
>>>>>>>>>> have added significant traffic from Jan this year,
which would not lead to
>>>>>>>>>> writes in the db but would lead to high reads to
see if are any values.
>>>>>>>>>>
>>>>>>>>>> Are there any settings that can be changed to allow
better ratio.
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Anishek
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Tyler Hobbs
>>>>>>>>> DataStax <http://datastax.com/>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message