cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11383) SASI index build leads to massive OOM
Date Sat, 19 Mar 2016 17:36:33 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202876#comment-15202876
] 

Jack Krupansky commented on CASSANDRA-11383:
--------------------------------------------

The int field could easily be made a text field if that would make SASI work better (you can
even do prefix query by year then.)

Point 1 is precisely what SASI SPARSE is designed for. It also is what Materialized Views
(formerly Global Indexes) is for and MV is even better for since it eliminates the need to
scan multiple nodes since the rows get collected based on the new partition key that can include
the indexed data value.

You're using cardinality backwards - it is supposed to be a measure of the number of distinct
values in a column, not the number of rows containing each value. See: https://en.wikipedia.org/wiki/Cardinality_%28SQL_statements%29.
Granted, in ERD cardinality is the count of rows in a second table for each column value in
a given table (one to n, n to one, etc.), but in the context of an index there is only one
table involved, although you could consider the index to be a table, but that would be a little
odd. In any case, best to stick with the standard SQL meaning of the cardinality of data values
in a column. So, to be clear, an email address is high cardinality and gender is low cardinality.
And the end of month int field is low cardinality or not dense in the original SASI doc terminology.

> SASI index build leads to massive OOM
> -------------------------------------
>
>                 Key: CASSANDRA-11383
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11383
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: C* 3.4
>            Reporter: DOAN DuyHai
>         Attachments: CASSANDRA-11383.patch, new_system_log_CMS_8GB_OOM.log, system.log_sasi_build_oom
>
>
> 13 bare metal machines
> - 6 cores CPU (12 HT)
> - 64Gb RAM
> - 4 SSD in RAID0
>  JVM settings:
> - G1 GC
> - Xms32G, Xmx32G
> Data set:
>  - ≈ 100Gb/per node
>  - 1.3 Tb cluster-wide
>  - ≈ 20Gb for all SASI indices
> C* settings:
> - concurrent_compactors: 1
> - compaction_throughput_mb_per_sec: 256
> - memtable_heap_space_in_mb: 2048
> - memtable_offheap_space_in_mb: 2048
> I created 9 SASI indices
>  - 8 indices with text field, NonTokenizingAnalyser,  PREFIX mode, case-insensitive
>  - 1 index with numeric field, SPARSE mode
>  After a while, the nodes just gone OOM.
>  I attach log files. You can see a lot of GC happening while index segments are flush
to disk. At some point the node OOM ...
> /cc [~xedin]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message