cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Stupp (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-11206) Support large partitions on the 3.0 sstable format
Date Sun, 27 Mar 2016 10:40:25 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Robert Stupp updated CASSANDRA-11206:
-------------------------------------
    Status: Patch Available  (was: In Progress)

Pushed the latest version to the [git branch|https://github.com/apache/cassandra/compare/trunk...snazy:11206-large-part-trunk?expand=1].
CI results ([testall|http://cassci.datastax.com/job/snazy-11206-large-part-trunk-testall/lastCompletedBuild/testReport/],
[dtest|http://cassci.datastax.com/job/snazy-11206-large-part-trunk-dtest/lastCompletedBuild/testReport/])
and cstar results (see below) look good.

The initial approach was to “ban” all {{IndexInfo}} instances from the key cache. Although
this is a great option for big partitions, “moderately” sized partitions suffer from that
approach (see “0kB” cstar run below). So, as a compromise a new {{cassandra.yaml}} option
{{column_index_cache_size_in_kb}} that defines when {{IndexInfo}} objects should not be kept
on heap has been introduced. The new option defaults to 2 kB. It is possible to tune it to
lower values (down to 0) and higher values. Some thoughts about both directions:
* Setting the value to 0 or some other very low value will obviously reduce GC pressure at
the cost of high I/O
* The cost of accessing index samples on disk is two-folded: First, there’s the obvious
I/O cost via a {{RandomAccessReader}}. Second, that each {{RandomAccessReader}} instance has
its own buffer (which can be off- or on-heap, but seems to default to off-heap) - so there
seems to be some (quite small) overhead to borrow/release that buffer.
* The higher the value of {{column_index_cache_size_in_kb}}, the more objects will be in the
heap - therefore: GC pressure
* Note that the parameter refers to the _serialized_ size and not the _amount_ of {{IndexInfo}}
objects. This was chosen to get some more obvious relation between the size of {{IndexInfo}}
objects to the amount of consumed heap - size of {{IndexInfo}} objects is mostly related to
the size of the clustering keys.
* Also note that some internal system/schema tables - especially those for LWTs - use clustering
keys and therefore index samples.
* For workloads with a huge amount of random reads against a large data set, small values
for {{column_index_cache_size_in_kb}} (like the default value) are beneficial if the key cache
is always full (i.e. it is evicting a lot).

Some local tests with the new {{LargePartitionTest}} on my Macbook (time machine + indexing
turned off) indicate that caching seems to work for shallow indexed entries.

I’ve scheduled some cstar runs against the _trades_ workload. Only the result with {{column_index_cache_size_in_kb:
0}} (which means, that no {{IndexInfo}} will be kept on heap (and in the key cache) shows
a performance regression. The default value of 2kb for {{column_index_cache_size_in_kb}} was
chosen as a result of this experiment.
* {{column_index_cache_size_in_kb: 0}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=e96c871e-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2794.77&ymin=0&ymax=141912.1]
* {{column_index_cache_size_in_kb: 2}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=410592e2-f288-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2044.46&ymin=0&ymax=142732.7]
* {{column_index_cache_size_in_kb: 4}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=f8e36ec4-f275-11e5-a3d3-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2101.44&ymin=0&ymax=141696.5]
* {{column_index_cache_size_in_kb: 8}} - [cstar result|http://cstar.datastax.com/graph?command=one_job&stats=be3516a6-f275-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=2057.88&ymin=0&ymax=142156.3]

Other cstar runs ([here|http://cstar.datastax.com/graph?command=one_job&stats=ce9de45a-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=write&smoothing=1&show_aggregates=true],
[here|http://cstar.datastax.com/graph?command=one_job&stats=c97118bc-f275-11e5-95fb-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=259.82&ymin=0&ymax=73909]
and [here|http://cstar.datastax.com/graph?command=one_job&stats=c4ece8d4-f275-11e5-83a4-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=496.32&ymin=0&ymax=89063.7])
have shown that there’s no change for some plain workloads.

Daily regression tests show a similar performance: [compaction|http://cstar.datastax.com/graph?command=one_job&stats=86d7cda8-f346-11e5-8ef0-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=54.56&ymin=0&ymax=275053.9],
[repair|http://cstar.datastax.com/graph?command=one_job&stats=9c78fd1c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=55.88&ymin=0&ymax=279059],
[STCS|http://cstar.datastax.com/graph?command=one_job&stats=ac43b886-f346-11e5-8ef0-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=170.39&ymin=0&ymax=98341.1],
[DTCS|http://cstar.datastax.com/graph?command=one_job&stats=b8e0a11c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=172.15&ymin=0&ymax=96739.5],
[LCS|http://cstar.datastax.com/graph?command=one_job&stats=f2b4530c-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_write&smoothing=1&show_aggregates=true&xmin=0&xmax=171.6&ymin=0&ymax=94480.1],
[1 MV|http://cstar.datastax.com/graph?command=one_job&stats=c9fcd4e8-f346-11e5-82b8-0256e416528f&metric=op_rate&operation=1_user&smoothing=1&show_aggregates=true&xmin=0&xmax=1401.73&ymin=0&ymax=70372.5],
[3 MV|http://cstar.datastax.com/tests/id/d80c88a8-f346-11e5-82b8-0256e416528f], [rolling upgrade|http://cstar.datastax.com/graph?command=one_job&stats=44be9a90-f347-11e5-b06a-0256e416528f&metric=op_rate&operation=5_read&smoothing=1&show_aggregates=true&xmin=0&xmax=201.41&ymin=0&ymax=171992.7]

Summary of the changes:
* Ungenerified {{RowIndexEntry}}
* {{RowIndexEntry}} now has a method to create an accessor to the {{IndexInfo}} objects on
disk - that accessor requires an instance of {{FileDataInput}}
* {{RowIndexEntry}} now has three subclasses: {{ShallowIndexedEntry}}, which is basically
the old {{IndexedEntry}} with the {{IndexInfo}} array-list removed but only responsible for
index files with an offsets-table, and {{LegacyShallowIndexedEntry}} which is responsible
for index files without an offsets-table (so pre-3.0). {{IndexedEntry}} keeps the {{IndexInfo}}
objects in an array - used if the serialized size of the RIE’s payload is less than the
new cassandra.yaml parameter {{column_index_cache_size_in_kb}}.
* {{RowIndexEntry.IndexInfoRetriever}} is the interface to access {{IndexInfo}} on disk using
a {{FileDataInput}}. It has concrete implementations: one for sstable versions with offsets
and one for legacy sstable versions. This one is only used from {{AbstractSSTableIterator.IndexState}}.

* Added “cache” of already deserialized {{IndexInfo}} instances in the base class of {{IndexInfoRetriever}}
for “shallow” indexed entries. This is not necessary for binary-search but for many other
access patterns, which sometimes appear to “jump around” in the {{IndexInfo}} objects.
Since {{IndexState}} is a short lived object, these cached {{IndexInfo}} instances get garbage
collected early.
* Writing of index files is also changed. It now switches to serialization into a byte buffer
instead of collecting an array-list of {{IndexInfo}} objects, when {{column_index_cache_size_in_kb}}
is hit.
* Bumped version of serialized key-cache from {{d}} to {{e}}. The key cache and its serialized
form no longer contain {{IndexInfo}} objects for indexed entries that exceed {{column_index_cache_size_in_kb}}
but need the position in the index file. Therefore, the serialized format of the key cache
has changed.
* {{Serializers}} (which is an instance per {{CFMetaData}}) keeps a “singleton” {{IndexInfo.Serializer}}
instance for {{BigFormat.latestVersion}} and constructs and keeps instances for other versions.
For “shallow” RIEs we need an instance of {{IndexInfo.Serializer}} to read {{IndexInfo}}
from disk - a “singleton” further reduces the number of objects on heap. TBC we create(d)
a lot of these instances (roughly one per {{IndexInfo}} instance/operation). We could also
reduce the number of {{IndexSerializer}} instances in the future - but it felt not to be necessary
for this ticket.
* Merged RIE’s {{IndexSerializer}} interface and {{Serializer}} class (that interface had
only a single implementation)
* Added methods to {{IndexSerializer}} to handle the special serialization for saved key caches
* Added specialized {{deserializePosition}} method to {{IndexSerializer}} as some invocations
just need the position in the data file.
* Moved {{IndexInfo}} binary-search into {{AbstractSSTableIterator.IndexState}} class (the
only place, where it’s used)
* Added some more {{skip}} methods in various places. These are required to calculate the
offsets array for legacy sstable versions.
* Classes {{ColumnIndex}} and {{IndexHelpler}} have been removed (functionality moved), {{IndexInfo}}
is now a top-level class.
* Added some {{Pre_11206_*}} classes that are copies of the previous implementations into
{{RowIndexEntryTest}}
* Added new {{PagingQueryTest}} to test paged queries
* Added new {{LargePartitionsTest}} to test/compare various partition sizes (to be run explicitly,
otherwise ignored)
* Added test methods in {{KeyCacheTest}} and {{KeyCacheCqlTest}} for shallow/non-shallow indexed
entries.
* Also re-added behavior of CASSANDRA-8180 for {{IndexedEntry}} (but not for ShallowIndexedEntry}})


> Support large partitions on the 3.0 sstable format
> --------------------------------------------------
>
>                 Key: CASSANDRA-11206
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11206
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Jonathan Ellis
>            Assignee: Robert Stupp
>             Fix For: 3.x
>
>         Attachments: 11206-gc.png, trunk-gc.png
>
>
> Cassandra saves a sample of IndexInfo objects that store the offset within each partition
of every 64KB (by default) range of rows.  To find a row, we binary search this sample, then
scan the partition of the appropriate range.
> The problem is that this scales poorly as partitions grow: on a cache miss, we deserialize
the entire set of IndexInfo, which both creates a lot of GC overhead (as noted in CASSANDRA-9754)
but is also non-negligible i/o activity (relative to reading a single 64KB row range) as partitions
get truly large.
> We introduced an "offset map" in CASSANDRA-10314 that allows us to perform the IndexInfo
bsearch while only deserializing IndexInfo that we need to compare against, i.e. log(N) deserializations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message