couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts
Date Fri, 08 Feb 2019 13:55:13 GMT
On Thu, Feb 7, 2019 at 9:35 PM Adam Kocoloski <kocolosk@apache.org> wrote:
>
> Bob, Garren, Jan - heard you loud and clear, K.I.S.S. I do think it’s a bit “simplistic"
to exclusively choose simplicity over performance and storage density. We’re (re)building
a database here, one that has some users with pretty demanding performance and scalability
requirements. And yes, we should certainly be testing and measuring. Kyle and team are setting
up infrastructure in IBM land to help with that now, but I also believe we can design the
data model and architecture with a basic performance model of FoundationDB in mind:
>
> - reads cost 1ms
> - short range reads are the same cost as a single lookup
> - reads of independent parts of the keyspace can be parallelized for cheap
> - writes are zero-cost until commit time
>
> We ought to be able to use these assumptions to drive some decisions about data models
ahead of any end-to-end performance test.
>
> If there are specific elements of the edit conflicts management where you think greater
simplicity is warranted, let’s get those called out. Ilya noted (correctly, in my opinion)
that the term sharing stuff is one of those items. It’s relatively complex, potentially
a performance hit, and only saves on storage density in the corner case of lots of edit conflicts.
That’s a good one to drop.
>
> I’m relatively happy with the revision history data model at this point. Hopefully
folks find it easy to grok, and it’s efficient for both reads and writes. It costs some
extra storage for conflict revisions compared to the current tree representation (up to 16K
per edit branch, with default _revs_limit) but knowing what we know about the performance
death spiral for wide revision trees today I’ll happily make a storage vs. performance tradeoff
here :)
>
> Setting the shared term approach aside, I’ve still been mulling over the key structure
for the actual document data:
>
> -  I thought about trying to construct a special _conflicts subspace, but I don’t like
that approach because the choice of a “winning" revision can flip back and forth very quickly
with concurrent writers to different edit branches. I think we really want to have a way for
revisions to naturally sort themselves so the winner is the first or last revision in a list.
>
> - Assuming we’re using key paths of the form (docid, revision-ish, path, to, field),
the goal here is to find an efficient way to get the last key with prefix “docid” (assuming
winner sorts last), and then all the keys that share the same (docid, revision-ish) prefix
as that one. I see two possible approaches so far, neither perfect:
>
> Option 1: Execute a get_key() operation with a key selector that asks for the last key
less than “docid\xFF” (again assuming winner sorts last), and then do a get_range_startswith()
request setting the streaming mode to “want_all” and the prefix to the docid plus whatever
revision-ish we found from the get_key() request. This is two roundtrips instead of one, but
it always retrieves exactly the right set of keys, and the second step is executed as fast
as possible.
>
> Option 2: Jump straight to get_range_startswith() request using only “docid” as the
prefix, then cancel the iteration once we reach a revision not equal to the first one we see.
We might transfer too much data, or we might end up doing multiple roundtrips if the default
“iterator” streaming mode sends too little data to start (I haven’t checked what the
default iteration block is there), but in the typical case of zero edit conflicts we have
a good chance of retrieving the full document in one roundtrip.
>

I'm working through getting fdb bindings written and fully tested so
have a couple notes for the range streaming bits.

The first somewhat surprising bit was that the want_all streaming mode
doesn't actually return all rows in a single call. If the range is
large enough then a client has to invoke the get_range a number of
times tweaking parameters to know when to stop. I just ran into that
last night but discovered the logic isn't terribly complicated. The
Python implementation is at [1] for reference.

> I don’t have a good sense of which option wins out here from a performance perspective,
but they’re both operating on the same data model so easy enough to test the alternatives.
The important bit is getting the revision-ish things to sort correctly. I think we can do
that by generating something like
>

There are a number of different streaming modes and my wager will be
that we'll have to play with them a bit to get an idea of which is
best and likely it'll be which is best in which scenario. For
instance, when we have a defined range, want_all likely wins, but in
the "iterate until next revision found" situation I'd wager the
`iterator` mode would likely be best as it appears to adaptively size
row batches.

> revision-ish = NotDeleted/1bit : RevPos : RevHash
>
> with some suitable order-preserving encoding on the RevPos integer.
>

The tuple layer already supports minimally sized order preserving
encodings for integers [2] so reusing that is enough. Shrinking the
deleted flag into one bit may be a bit of a bother as the tuple layer
encodes booleans as a single byte (though its the single type byte,
not a type+value byte). A possibly simple approach to shrinking that
would be that for all deleted revisions we just encode the RevPos as a
negative number. That seems legit though its definitely registering on
my "too cute" meter so I may be missing a reason that'd break before
I'm properly caffeinated.

Paul

[1] https://github.com/apple/foundationdb/blob/6.0.15/bindings/python/fdb/impl.py#L358-L395
[2] https://github.com/apple/foundationdb/blob/master/design/tuple.md#integer

Mime
View raw message