couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Khlopotov <iil...@apache.org>
Subject Re: # [DISCUSS] : things we need to solve/decide : storage of edit conflicts
Date Wed, 06 Feb 2019 17:00:35 GMT
Hi Adam,

> Did you figure out how to avoid reading the entire merged set of json_paths when you
want to write a new revision? I can’t see how to define pointers to existing KVs (which
seems to be where you’re looking to gain storage efficiencies) without doing that full read.

No. This is still a problem, the full read is required at the moment.

best regards,
iilyak

On 2019/02/06 16:27:44, Adam Kocoloski <kocolosk@apache.org> wrote: 
> Hi Ilya,
> 
> > While reading Adam's proposal I came to realize that: we don't have to calculate
winning revision at read time.
> > Since FDB's transactions are atomic we can calculate it when we write. This means
we can just write latest values into separate range. This makes lookup of latest version fast.
> 
> Yep, that can work. It’d be good if we can avoid extra storage overhead in the common
case where a document only has a single edit branch, but I do like preserving fast read performance
for the “winning” revision in all cases.
> 
> > The `rev -> vsn` indirection is not strictly required. It is added because it
saves some space since `rev` is a long string and `vsn` is FDB versionstamp of fixed size.
> 
> Yep, a document _rev is a number followed by an MD5 hash, so 20 bytes if you use an int32
for the first part. A versionstamp is 10 bytes, so it does save a bit of space for any document
with more than a couple of fields.
> 
> I didn’t quite grok some parts of the pseudocode you have here. Did you figure out
how to avoid reading the entire merged set of json_paths when you want to write a new revision?
I can’t see how to define pointers to existing KVs (which seems to be where you’re looking
to gain storage efficiencies) without doing that full read.
> 
> Adam
> 
> > On Feb 5, 2019, at 9:18 PM, Ilya Khlopotov <iilyak@apache.org> wrote:
> > 
> > While reading Adam's proposal I came to realize that: we don't have to calculate
winning revision at read time.
> > Since FDB's transactions are atomic we can calculate it when we write. This means
we can just write latest values into separate range. This makes lookup of latest version fast.
> > Another realization is if we want to share values for some json paths we would have
to introduce a level of indirection. 
> > Bellow is the data model inspired by Adam's idea to share json_paths. In this model
the json_path is stored in the revision where it was first added (we call that revision an
owner of a json_path). The values for json_path key can be scalar values, parts of scalar
values or pointers to owner location.
> > The below snippets are sketches of transactions. 
> > The transactions will include updates to other keys as needed (`external_size`,
`by_seq` and so on).  The revision tree management is not covered yet.
> > The `rev -> vsn` indirection is not strictly required. It is added because it
saves some space since `rev` is a long string and `vsn` is FDB versionstamp of fixed size.
> > 
> > - `{NS} / {docid} / _by_rev / {rev} = vsn` 
> > - `{NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL`
> > - `{NS} / {docid} / _data / {json_path} = latest_value | part`
> > - `{NS} / {docid} / {vsn} / _data / {json_path} = value | part | {another_vsn}`
> > 
> > ```
> > write(txn, doc_id, prev_rev, json):
> >   txn.add_write_conflict_key("{NS} / {doc_id} / _rev")
> >   rev = generate_new_rev()
> >   txn["{NS} / {docid} / _by_rev / {rev}"] = vsn 
> >   for every json_path in flattened json
> >     - {NS} / {docid} / _used_by / {json_path} / {another_vsn} = NIL
> >     if rev is HEAD:
> >       # this range contains values for all json paths for the latest revision (read
optimization)
> >       - {NS} / {docid} / _data / {json_path} = latest_value | part
> >     - {NS} / {docid} / {vsn} / _data / {json_path} = value | part | {another_vsn}
> >   txn["{NS} / {doc_id} / _rev"] = rev
> > 
> > get_current(txn, doc_id):
> >  # there is no sharing of json_paths in this range (read optimization)
> >  txn.get_range("{NS} / {docid} / _data / 0x00", "{NS} / {docid} / _data / 0xFF"
)
> > 
> > get_revision(txn, doc_id, rev):
> >  vsn = txn["{NS} / {docid} / _by_rev / {rev}"] 
> >  json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", "{NS} / {vsn}
/ {docid} / _data / 0xFF" )
> >  for every json_path in json_paths:
> >    if value has type vsn:
> >       another_vsn = value
> >          value = txn["{NS} / {docid} / {another_vsn} / _data / {json_path}"]
> >    result[json_path] = value
> > 
> > delete_revision(txn, doc_id, rev):
> >  vsn = txn["{NS} / {docid} / _by_rev / {rev}"] 
> >  json_paths = txn.get_range("{NS} / {vsn} / {docid} / _data / 0x00", "{NS} / {vsn}
/ {docid} / _data / 0xFF" )
> >  for every json_path in json_paths:
> >    if value has type vsn:
> >      # remove reference to deleted revision from the owner
> >       del txn[{NS} / {docid} / _used_by / {json_path} / {vsn}]
> >    # check if deleted revision of json_path is not used by anything else
> >    if txn.get_range("{NS} / {docid} / _used_by / {json_path} / {vsn}", limit=1)
== []:
> >       del txn["{NS} / {docid} / {vsn} / _data / {json_path}"]
> >    if vsn is HEAD:
> >       copy range for winning revision into "{NS} / {docid} / _data / {json_path}"
> > ```
> > 
> > best regards,
> > iilyak
> > 
> > On 2019/02/04 23:22:09, Adam Kocoloski <kocolosk@apache.org> wrote: 
> >> I think it’s fine to start a focused discussion here as it might help inform
some of the broader debate over in that thread.
> >> 
> >> As a reminder, today CouchDB writes the entire body of each document revision
on disk as a separate blob. Edit conflicts that have common fields between them do not share
any storage on disk. The revision tree is encoded into a compact format and a copy of it is
stored directly in both the by_id tree and the by_seq tree. Each leaf entry in the revision
tree contain a pointer to the position of the associated doc revision on disk.
> >> 
> >> As a further reminder, CouchDB 2.x clusters can generate edit conflict revisions
just from multiple clients concurrently updating the same document in a single cluster. This
won’t happen when FoundationDB is running under the hood, but users who deploy multiple
CouchDB or PouchDB servers and replicate between them can of course still produce conflicts
just like they could in CouchDB 1.x, so we need a solution.
> >> 
> >> Let’s consider the two sub-topics separately: 1) storage of edit conflict
bodies and 2) revision trees
> >> 
> >> ## Edit Conflict Storage
> >> 
> >> The simplest possible solution would be to store each document revision separately,
like we do today. We could store document bodies with (“docid”, “revid”) as the key
prefix, and each transaction could clear the key range associated with the base revision against
which the edit is being attempted. This would work, but I think we can try to be a bit more
clever and save on storage space given that we’re splitting JSON documents into multiple
KV pairs.
> >> 
> >> One thought I’d had is to introduce a special enum Value which indicates that
the subtree “beneath” the given Key is in conflict. For example, consider the documents
> >> 
> >> {
> >>    “_id”: “foo”,
> >>    “_rev”: “1-abc”,
> >>    “owner”: “alice”,
> >>    “active”: true
> >> }
> >> 
> >> and 
> >> 
> >> {
> >>    “_id”: “foo”,
> >>    “_rev”: “1-def”,
> >>    “owner”: “bob”,
> >>    “active”: true
> >> }
> >> 
> >> We could represent these using the following set of KVs:
> >> 
> >> (“foo”, “active”) = true
> >> (“foo”, “owner”) = kCONFLICT
> >> (“foo”, “owner”, “1-abc”) = “alice”
> >> (“foo”, “owner”, “1-def”) = “bob”
> >> 
> >> This approach also extends to conflicts where the two versions have different
data types. Consider a more complicated example where bob dropped the “active” field and
changed the “owner” field to an object:
> >> 
> >> {
> >>  “_id”: “foo”,
> >>  “_rev”: “1-def”,
> >>  “owner”: {
> >>    “name”: “bob”,
> >>    “email”: “bob@example.com"
> >>  }
> >> }
> >> 
> >> Now the set of KVs for “foo” looks like this (note that a missing field
needs to be handled explicitly):
> >> 
> >> (“foo”, “active”) = kCONFLICT
> >> (“foo”, “active”, “1-abc”) = true
> >> (“foo”, “active”, “1-def”) = kMISSING
> >> (“foo”, “owner”) = kCONFLICT
> >> (“foo”, “owner”, “1-abc”) = “alice”
> >> (“foo”, “owner”, “1-def”, “name”) = “bob”
> >> (“foo”, “owner”, “1-def”, “email”) = “bob@example.com”
> >> 
> >> I like this approach for the common case where documents share most of their
data in common but have a conflict in a very specific field or set of fields. 
> >> 
> >> I’ve encountered one important downside, though: an edit that replicates in
and conflicts with the entire document can cause a bit of a data explosion. Consider a case
where I have 10 conflicting versions of a 100KB document, but the conflicts are all related
to a single scalar value. Now I replicate in an empty document, and suddenly I have a kCONFLICT
at the root. In this model I now need to list out every path of every one of the 10 existing
revisions and I end up with a 1MB update. Yuck. That’s technically no worse in the end state
than the “zero sharing” case above, but one could easily imagine overrunning the transaction
size limit this way.
> >> 
> >> I suspect there’s a smart path out of this. Maybe the system detects a “default”
value for each field and uses that instead of writing out the value for every revision in
a conflicted subtree. Worth some discussion.
> >> 
> >> ## Revision Trees
> >> 
> >> In CouchDB we currently represent revisions as a hash history tree; each revision
identifier is derived from the content of the revision including the revision identifier of
its parent. Individual edit branches are bounded in *length* (I believe the default is 1000
entries), but the number of edit branches is technically unbounded.
> >> 
> >> The size limits in FoundationDB preclude us from storing the entire key tree
as a single value; in pathological situations the tree could exceed 100KB. Rather, I think
it would make sense to store each edit *branch* as a separate KV. We stem the branch long
before it hits the value size limit, and in the happy case of no edit conflicts this means
we store the edit history metadata in a single KV. It also means that we can apply an interactive
edit without retrieving the entire conflicted revision tree; we need only retrieve and modify
the single branch against which the edit is being applied. The downside is that we duplicate
historical revision identifiers shared by multiple edit branches, but I think this is a worthwhile
tradeoff.
> >> 
> >> I would furthermore try to structure the keys so that it is possible to retrieve
the “winning” revision in a single limit=1 range query. Ideally I’d like to proide the
following properties:
> >> 
> >> 1) a document read does not need to retrieve the revision tree at all, just
the winning revision identifier (which would be stored with the rest of the doc)
> >> 2) a document update only needs to read the edit branch of the revision tree
against which the update is being applied, and it can read that branch immediately knowing
only the content of the edit that is being attempted (i.e., it does not need to read the current
version of the document itself).
> >> 
> >> So, I’d propose a separate subspace (maybe “_meta”?) for the revision
trees, with keys and values that look like
> >> 
> >> (“_meta”, DocID, IsDeleted, RevPosition, RevHash) = [ParentRev, GrandparentRev,
…]
> >> 
> >> The inclusion of IsDeleted, RevPosition and RevHash in the key should be sufficient
(with the right encoding) to create a range query that automatically selects the “winner”
according to CouchDB’s arcane rules, which are something like
> >> 
> >> 1) deleted=false beats deleted=true
> >> 2) longer paths (i.e. higher RevPosition) beat shorter ones
> >> 3) RevHashes with larger binary values beat ones with smaller values
> >> 
> >> ===========
> >> 
> >> OK, that’s all on this topic from me for now. I think this is a particularly
exciting area where we start to see the dividends of splitting up data into multiple KV pairs
in FoundationDB :) Cheers,
> >> 
> >> Adam
> >> 
> >> 
> >>> On Feb 4, 2019, at 2:41 PM, Robert Newson <rnewson@apache.org> wrote:
> >>> 
> >>> This one is quite tightly coupled to the other thread on data model, should
we start much conversation here before that one gets closer to a solution?
> >>> 
> >>> -- 
> >>> Robert Samuel Newson
> >>> rnewson@apache.org
> >>> 
> >>> On Mon, 4 Feb 2019, at 19:25, Ilya Khlopotov wrote:
> >>>> This is a beginning of a discussion thread about storage of edit 
> >>>> conflicts and everything which relates to revisions.
> >>>> 
> >>>> 
> >> 
> >> 
> 
> 

Mime
View raw message