couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Davis <paul.joseph.da...@gmail.com>
Subject Re: [DISCUSS] : things we need to solve/decide : storing JSON documents
Date Tue, 19 Feb 2019 22:41:23 GMT
A simple doc storage version number would likely be enough for future us to
do fancier things.

On Tue, Feb 19, 2019 at 4:16 PM Benjamin Anderson <banjiewen@apache.org>
wrote:

> > I don’t think adding a layer of abstraction is the right move just yet,
> I think we should continue to find consensus on one answer to this question
>
> Agree that the theorycrafting stage is not optimal for making
> abstraction decisions, but I suspect it would be worthwhile somewhere
> between prototyping and releasing. Adam's proposal does seem to me the
> most appealing approach on the surface, and I don't see anyone signing
> up to do the work to deliver an alternative concurrently.
>
> --
> ba
>
> On Tue, Feb 19, 2019 at 1:43 PM Robert Samuel Newson <rnewson@apache.org>
> wrote:
> >
> > Addendum: By “directory aliasing” I meant within a document (either the
> actual Directory thing or something equivalent of our own making). The
> directory aliasing for each database is a good way to reduce key size
> without a significant cost. Though if Redwood lands in time, even this
> would become an inutile obfuscation].
> >
> > > On 19 Feb 2019, at 21:39, Robert Samuel Newson <rnewson@apache.org>
> wrote:
> > >
> > > Interesting suggestion, obviously the details might get the wrong kind
> of fun.
> > >
> > > Somewhere above I suggested this would be something we could change
> over time and even use different approaches for different documents within
> the same database. This is the long way of saying there are multiple ways
> to do this each with advantages and none without disadvantages.
> > >
> > > I don’t think adding a layer of abstraction is the right move just
> yet, I think we should continue to find consensus on one answer to this
> question (and the related ones in other threads) for the first release.
> It’s easy to say “we can change it later”, of course. We can, though it
> would be a chunk of work in the context of something that already works,
> I’ve rarely seen anyone sign up for that.
> > >
> > > I’m fine with the first proposal from Adam, where the keys are tuples
> of key parts pointing at terminal values. To make it easier for the first
> version, I would exclude optimisations like deduplication or the Directory
> aliasing or the schema thing that I suggested and that Ilya incorporated a
> variant of in a follow-up post. We’d accept that there are limits on the
> sizes of documents, including the awkward-to-express one about property
> depth.
> > >
> > > Stepping back, I’m not seeing any essential improvement over Adam’s
> original proposal besides the few corrections and clarifications made by
> various authors. Could we start an RFC based on Adam’s original proposal on
> document body, revision tree and index storage? We could then have PR’s
> against that for each additional optimisation (one person’s optimisation is
> another person’s needless complication)?
> > >
> > > If I’ve missed some genuine advance on the original proposal in this
> long thread, please call it out for me.
> > >
> > > B.
> > >
> > >> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjiewen@apache.org>
> wrote:
> > >>
> > >> As is evident by the length of this thread, there's a pretty big
> > >> design space to cover here, and it seems unlikely we'll have arrived
> > >> at a "correct" solution even by the time this thing ships. Perhaps it
> > >> would be worthwhile to treat the in-FDB representation of data as a
> > >> first-class abstraction and support multiple representations
> > >> simultaneously?
> > >>
> > >> Obviously there's no such thing as a zero-cost abstraction - and I've
> > >> not thought very hard about how far up the stack the document
> > >> representation would need to leak - but supporting different layouts
> > >> (primarily, as Adam points out, on the document body itself) might
> > >> prove interesting and useful. I'm sure there are folks interested in a
> > >> column-shaped CouchDB, for example.
> > >>
> > >> --
> > >> b
> > >>
> > >> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnewson@apache.org>
> wrote:
> > >>>
> > >>> Good points on revtree, I agree with you we should store that
> intelligently to gain the benefits you mentioned.
> > >>>
> > >>> --
> > >>> Robert Samuel Newson
> > >>> rnewson@apache.org
> > >>>
> > >>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote:
> > >>>> I do not think we should store the revtree as a blob. The design
> where
> > >>>> each edit branch is its own KV should save on network IO and CPU
> cycles
> > >>>> for normal updates. We’ve performed too many heroics to keep
> > >>>> couch_key_tree from stalling entire databases when trying to update
> a
> > >>>> single document with a wide revision tree, I would much prefer to
> ignore
> > >>>> other edit branches entirely when all we’re doing is extending one
> of
> > >>>> them.
> > >>>>
> > >>>> I also do not think we should store JSON documents as blobs, but
> it’s a
> > >>>> closer call. Some of my reasoning for preferring the exploded path
> > >>>> design:
> > >>>>
> > >>>> - it lends itself nicely to sub-document operations, for which Jan
> > >>>> crafted an RFC last year:
> https://github.com/apache/couchdb/issues/1559
> > >>>> - it optimizes the creation of Mango indexes on existing databases
> since
> > >>>> we only need to retrieve the value(s) we want to index
> > >>>> - it optimizes Mango queries that use field selectors
> > >>>> - anyone who wanted to try their hand at GraphQL will find it very
> > >>>> handy: https://github.com/apache/couchdb/issues/1499
> > >>>> - looking further ahead, it lets us play with smarter leaf value
> types
> > >>>> like Counters (yes I’m still on the CRDT bandwagon, sorry)
> > >>>>
> > >>>> A few comments on the thread:
> > >>>>
> > >>>>>>> * Most documents bodies are probably going to be smaller than
> 100k. So in
> > >>>>>>> the majority of case it would be one write / one read to update
> and fetch
> > >>>>>>> the document body.
> > >>>>
> > >>>> We should test, but I expect reading 50KB of data in a range query
> is
> > >>>> almost as efficient as reading a single 50 KB value. Similarly,
> writes
> > >>>> to a contiguous set of keys should be quite efficient.
> > >>>>
> > >>>> I am concerned about the overhead of the repeated field paths in the
> > >>>> keys with the exploded path option in the absence of key prefix
> > >>>> compression. That would be my main reason to acquiesce and throw
> away
> > >>>> all the document structure.
> > >>>>
> > >>>> Adam
> > >>>>
> > >>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnewson@apache.org>
> wrote:
> > >>>>>
> > >>>>> I like the idea that we'd reuse the same pattern (but perhaps not
> the same _code_) for doc bodies, revtree and attachments.
> > >>>>>
> > >>>>> I hope we still get to delete couch_key_tree.erl, though.
> > >>>>>
> > >>>>> --
> > >>>>> Robert Samuel Newson
> > >>>>> rnewson@apache.org
> > >>>>>
> > >>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
> > >>>>>> I like the idea from a “trying a simple thing first” perspective,
> but
> > >>>>>> Nick’s points below are especially convincing to with this for
> now.
> > >>>>>>
> > >>>>>> Best
> > >>>>>> Jan
> > >>>>>> —
> > >>>>>>
> > >>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatamane@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>> Hi,
> > >>>>>>>
> > >>>>>>> Sorry for jumping in so late, I was following from the sidelines
> mostly. A
> > >>>>>>> lot of good discussion happening and am excited about the
> possibilities
> > >>>>>>> here.
> > >>>>>>>
> > >>>>>>> I do like the simpler "chunking" approach for a few reasons:
> > >>>>>>>
> > >>>>>>> * Most documents bodies are probably going to be smaller than
> 100k. So in
> > >>>>>>> the majority of case it would be one write / one read to update
> and fetch
> > >>>>>>> the document body.
> > >>>>>>>
> > >>>>>>> * We could reuse the chunking code for attachment handling and
> possibly
> > >>>>>>> revision key trees. So it's the general pattern of upload chunks
> to some
> > >>>>>>> prefix, and when finished flip an atomic toggle to make it
> current.
> > >>>>>>>
> > >>>>>>> * Do the same thing with revision trees and we could re-use the
> revision
> > >>>>>>> tree manipulation logic. That is, the key tree in most cases
> would be small
> > >>>>>>> enough to fit in 100k but if they get huge, they'd get chunked.
> This would
> > >>>>>>> allow us to reuse all the battle tested couch_key_tree code
> mostly as is.
> > >>>>>>> We even have property tests for it
> > >>>>>>>
> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
> > >>>>>>>
> > >>>>>>> * It removes the need to explain the max exploded path length
> limitation to
> > >>>>>>> customers.
> > >>>>>>>
> > >>>>>>> Cheers,
> > >>>>>>> -Nick
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <
> rnewson@apache.org> wrote:
> > >>>>>>>
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> An alternative storage model that we should seriously consider
> is to
> > >>>>>>>> follow our current approach in couch_file et al. Specifically,
> that the
> > >>>>>>>> document _body_ is stored as an uninterpreted binary value.
> This would be
> > >>>>>>>> much like the obvious plan for attachment storage; a key prefix
> that
> > >>>>>>>> identifies the database and document, with the final item of
> that key tuple
> > >>>>>>>> is an incrementing integer. Each of those keys has a binary
> value of up to
> > >>>>>>>> 100k. Fetching all values with that key prefix, in fdb's
> natural ordering,
> > >>>>>>>> will yield the full document body, which can be JSON decoded
> for further
> > >>>>>>>> processing.
> > >>>>>>>>
> > >>>>>>>> I like this idea, and I like Adam's original proposal to
> explode documents
> > >>>>>>>> into property paths. I have a slight preference for the
> simplicity of the
> > >>>>>>>> idea in the previous paragraph, not least because it's close to
> what we do
> > >>>>>>>> today. I also think it will be possible to migrate to
> alternative storage
> > >>>>>>>> models in future, and foundationdb's transaction supports means
> we can do
> > >>>>>>>> this migration seamlessly should we come to it.
> > >>>>>>>>
> > >>>>>>>> I'm very interested in knowing if anyone else is interested in
> going this
> > >>>>>>>> simple, or considers it a wasted opportunity relative to the
> 'exploded'
> > >>>>>>>> path.
> > >>>>>>>>
> > >>>>>>>> B.
> > >>>>>>>>
> > >>>>>>>> --
> > >>>>>>>> Robert Samuel Newson
> > >>>>>>>> rnewson@apache.org
> > >>>>>>>>
> > >>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
> > >>>>>>>>> I've been remiss here in not posting the data model ideas that
> IBM
> > >>>>>>>>> worked up while we were thinking about using FoundationDB so
> I'm posting
> > >>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just
> transcribing
> > >>>>>>>>> it, and this is the context that the folks from the IBM side
> came in
> > >>>>>>>>> with, for full disclosure.
> > >>>>>>>>>
> > >>>>>>>>> Basics
> > >>>>>>>>>
> > >>>>>>>>> 1. All CouchDB databases are inside a Directory
> > >>>>>>>>> 2. Each CouchDB database is a Directory within that Directory
> > >>>>>>>>> 3. It's possible to list all subdirectories of a Directory, so
> > >>>>>>>>> `_all_dbs` is the list of directories from 1.
> > >>>>>>>>> 4. Each Directory representing a CouchdB database has several
> Subspaces;
> > >>>>>>>>> 4a. by_id/ doc subspace: actual document contents
> > >>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed
> > >>>>>>>>> 4c. index_definitions, indexes, ...
> > >>>>>>>>>
> > >>>>>>>>> JSON Mapping
> > >>>>>>>>>
> > >>>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs
> in FDB:
> > >>>>>>>>>
> > >>>>>>>>> {
> > >>>>>>>>> “_id”: “foo”,
> > >>>>>>>>> “owner”: “bob”,
> > >>>>>>>>> “mylist”: [1,3,5],
> > >>>>>>>>> “mymap”: {
> > >>>>>>>>>     “blue”: “#0000FF”,
> > >>>>>>>>>     “red”: “#FF0000”
> > >>>>>>>>> }
> > >>>>>>>>> }
> > >>>>>>>>>
> > >>>>>>>>> maps to
> > >>>>>>>>>
> > >>>>>>>>> (“foo”, “owner”) = “bob”
> > >>>>>>>>> (“foo”, “mylist”, 0) = 1
> > >>>>>>>>> (“foo”, “mylist”, 1) = 3
> > >>>>>>>>> (“foo”, “mylist”, 2) = 5
> > >>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF”
> > >>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000”
> > >>>>>>>>>
> > >>>>>>>>> NB: this means that the 100KB limit applies to individual
> leafs in the
> > >>>>>>>>> JSON object, not the entire doc
> > >>>>>>>>>
> > >>>>>>>>> Edit Conflicts
> > >>>>>>>>>
> > >>>>>>>>> We need to account for the presence of conflicts in various
> levels of
> > >>>>>>>>> the doc due to replication.
> > >>>>>>>>>
> > >>>>>>>>> Proposal is to create a special value indicating that the
> subtree below
> > >>>>>>>>> our current cursor position is in an unresolvable conflict.
> Then add
> > >>>>>>>>> additional KV pairs below to describe the conflicting entries.
> > >>>>>>>>>
> > >>>>>>>>> KV data model allows us to store these efficiently and minimize
> > >>>>>>>>> duplication of data:
> > >>>>>>>>>
> > >>>>>>>>> A document with these two conflicts:
> > >>>>>>>>>
> > >>>>>>>>> {
> > >>>>>>>>> “_id”: “foo”,
> > >>>>>>>>> “_rev”: “1-abc”,
> > >>>>>>>>> “owner”: “alice”,
> > >>>>>>>>> “active”: true
> > >>>>>>>>> }
> > >>>>>>>>> {
> > >>>>>>>>> “_id”: “foo”,
> > >>>>>>>>> “_rev”: “1-def”,
> > >>>>>>>>> “owner”: “bob”,
> > >>>>>>>>> “active”: true
> > >>>>>>>>> }
> > >>>>>>>>>
> > >>>>>>>>> could be stored thus:
> > >>>>>>>>>
> > >>>>>>>>> (“foo”, “active”) = true
> > >>>>>>>>> (“foo”, “owner”) = kCONFLICT
> > >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> > >>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
> > >>>>>>>>>
> > >>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting
> subtree this
> > >>>>>>>>> representation can handle conflicts of different data types as
> well.
> > >>>>>>>>>
> > >>>>>>>>> Missing fields need to be handled explicitly:
> > >>>>>>>>>
> > >>>>>>>>> {
> > >>>>>>>>> “_id”: “foo”,
> > >>>>>>>>> “_rev”: “1-abc”,
> > >>>>>>>>> “owner”: “alice”,
> > >>>>>>>>> “active”: true
> > >>>>>>>>> }
> > >>>>>>>>>
> > >>>>>>>>> {
> > >>>>>>>>> “_id”: “foo”,
> > >>>>>>>>> “_rev”: “1-def”,
> > >>>>>>>>> “owner”: {
> > >>>>>>>>> “name”: “bob”,
> > >>>>>>>>> “email”: “
> > >>>>>>>>> bob@example.com
> > >>>>>>>>> "
> > >>>>>>>>> }
> > >>>>>>>>> }
> > >>>>>>>>>
> > >>>>>>>>> could be stored thus:
> > >>>>>>>>>
> > >>>>>>>>> (“foo”, “active”) = kCONFLICT
> > >>>>>>>>> (“foo”, “active”, “1-abc”) = true
> > >>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
> > >>>>>>>>> (“foo”, “owner”) = kCONFLICT
> > >>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
> > >>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
> > >>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ...
> > >>>>>>>>>
> > >>>>>>>>> Revision Metadata
> > >>>>>>>>>
> > >>>>>>>>> * CouchDB uses a hash history for revisions
> > >>>>>>>>> ** Each edit is identified by the hash of the content of the
> edit
> > >>>>>>>>> including the base revision against which it was applied
> > >>>>>>>>> ** Individual edit branches are bounded in length but the
> number of
> > >>>>>>>>> branches is potentially unbounded
> > >>>>>>>>>
> > >>>>>>>>> * Size limits preclude us from storing the entire key tree as
> a single
> > >>>>>>>>> value; in pathological situations
> > >>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes)
> > >>>>>>>>>
> > >>>>>>>>> * Store each edit branch as a separate KV including deleted
> status in a
> > >>>>>>>>> special subspace
> > >>>>>>>>>
> > >>>>>>>>> * Structure key representation so that “winning” revision can
> be
> > >>>>>>>>> automatically retrieved in a limit=1
> > >>>>>>>>> key range operation
> > >>>>>>>>>
> > >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
> > >>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) =
> [“3-baz”,”2-bar”,”1-foo”]
> > >>>>>>>>> <-- winner
> > >>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
> > >>>>>>>>>
> > >>>>>>>>> Changes Feed
> > >>>>>>>>>
> > >>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte,
> unique,
> > >>>>>>>>> monotonically (but not sequentially) increasing value for each
> committed
> > >>>>>>>>> transaction. The first 8 bytes are the committed version of the
> > >>>>>>>>> database. The last 2 bytes are monotonic in the serialization
> order for
> > >>>>>>>>> transactions.
> > >>>>>>>>>
> > >>>>>>>>> * A transaction can specify a particular index into a key
> where the
> > >>>>>>>>> following 10 bytes will be overwritten by the versionstamp at
> commit
> > >>>>>>>>> time
> > >>>>>>>>>
> > >>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes
> feed
> > >>>>>>>>>
> > >>>>>>>>> by_seq subspace
> > >>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”)
> > >>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”)
> > >>>>>>>>>
> > >>>>>>>>> by_id subspace
> > >>>>>>>>> (“bar”, “_vsn”) = “versionstamp4”
> > >>>>>>>>> ...
> > >>>>>>>>> (“foo”, “_vsn”) = “versionstamp1”
> > >>>>>>>>>
> > >>>>>>>>> JSON Indexes
> > >>>>>>>>>
> > >>>>>>>>> * “Mango” JSON indexes are defined by
> > >>>>>>>>> ** a list of field names, each of which may be nested,
> > >>>>>>>>> ** an optional partial_filter_selector which constrains the
> set of docs
> > >>>>>>>>> that contribute
> > >>>>>>>>> ** an optional name defined by the ddoc field (the name is
> auto-
> > >>>>>>>>> generated if not supplied)
> > >>>>>>>>>
> > >>>>>>>>> * Store index definitions in a single subspace to aid query
> planning
> > >>>>>>>>> ** ((person,name), title, email) = (“name-title-email”,
> “{“student”:
> > >>>>>>>>> true}”)
> > >>>>>>>>> ** Store the values for each index in a dedicated subspace,
> adding the
> > >>>>>>>>> document ID as the last element in the tuple
> > >>>>>>>>> *** (“rosie revere”, “engineer”, “rosie@example.com", “foo”)
> = null
> > >>>>>>>>>
> > >>>>>>>>> B.
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Robert Samuel Newson
> > >>>>>>>>> rnewson@apache.org
> > >>>>>>>>>
> > >>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I want to fix previous mistakes. I did two mistakes in
> previous
> > >>>>>>>>>> calculations:
> > >>>>>>>>>> - I used 1Kb as base size for calculating expansion factor
> (although
> > >>>>>>>> we
> > >>>>>>>>>> don't know exact size of original document)
> > >>>>>>>>>> - The expansion factor calculation included number of
> revisions (it
> > >>>>>>>>>> shouldn't)
> > >>>>>>>>>>
> > >>>>>>>>>> I'll focus on flattened JSON docs model
> > >>>>>>>>>>
> > >>>>>>>>>> The following formula is used in previous calculation.
> > >>>>>>>>>>
> storage_size_per_document=mapping_table_size*number_of_revisions +
> > >>>>>>>>>> depth*number_of_paths*number_of_revisions +
> > >>>>>>>>>> number_of_paths*value_size*number_of_revisions
> > >>>>>>>>>>
> > >>>>>>>>>> To clarify things a little bit I want to calculate space
> requirement
> > >>>>>>>> for
> > >>>>>>>>>> single revision this time.
> > >>>>>>>>>>
> mapping_table_size=field_name_size*(field_name_length+4(integer
> > >>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
> > >>>>>>>>>>
> storage_size_per_document_per_revision_per_replica=mapping_table_size
> > >>>>>>>> +
> > >>>>>>>>>> depth*number_of_paths + value_size*number_of_paths =
> > >>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
> > >>>>>>>>>>
> > >>>>>>>>>> We definitely can reduce requirement for mapping table by
> adopting
> > >>>>>>>>>> rnewson's idea of a schema.
> > >>>>>>>>>>
> > >>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iilyak@apache.org>
> wrote:
> > >>>>>>>>>>> Hi Michael,
> > >>>>>>>>>>>
> > >>>>>>>>>>>> For example, hears a crazy thought:
> > >>>>>>>>>>>> Map every distinct occurence of a key/value instance
> through a
> > >>>>>>>> crypto hash
> > >>>>>>>>>>>> function to get a set of hashes.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> These can be be precomputed by Couch without any lookups in
> FDB.
> > >>>>>>>> These
> > >>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
> > >>>>>>>> themselves to
> > >>>>>>>>>>>> range search well.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So what you do is index them for frequency of occurring in
> the
> > >>>>>>>> same set.
> > >>>>>>>>>>>> In essence, you 'bucket them' statistically, and that
> bucket id
> > >>>>>>>> becomes a
> > >>>>>>>>>>>> key prefix. A crypto hash value can be copied into more
> than one
> > >>>>>>>> bucket.
> > >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> > >>>>>>>>>>>
> > >>>>>>>>>>>> When writing a document, Couch submits the list/array of
> > >>>>>>>> cryptohash values
> > >>>>>>>>>>>> it computed to FDB and gets back the corresponding
> {val_id} (the
> > >>>>>>>> id with
> > >>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if
> there's
> > >>>>>>>> always a
> > >>>>>>>>>>>> lot of app local cache misses.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A document's value is then a series of {val_id} arrays up
> to 100k
> > >>>>>>>> per
> > >>>>>>>>>>>> segment.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> When retrieving a document, you get the val_ids, find the
> distinct
> > >>>>>>>> buckets
> > >>>>>>>>>>>> and min/max entries for this doc, and then parallel query
> each
> > >>>>>>>> bucket while
> > >>>>>>>>>>>> reconstructing the document.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Interesting idea. Let's try to think it through to see if we
> can
> > >>>>>>>> make it viable.
> > >>>>>>>>>>> Let's go through hypothetical example. Input data for the
> example:
> > >>>>>>>>>>> - 1M of documents
> > >>>>>>>>>>> - each document is around 10Kb
> > >>>>>>>>>>> - each document consists of 1K of unique JSON paths
> > >>>>>>>>>>> - each document has 100 unique JSON field names
> > >>>>>>>>>>> - every scalar value is 100 bytes
> > >>>>>>>>>>> - 10% of unique JSON paths for every document already stored
> in
> > >>>>>>>> database under different doc or different revision of the
> current one
> > >>>>>>>>>>> - we assume 3 independent copies for every key-value pair in
> FDB
> > >>>>>>>>>>> - our hash key size is 32 bytes
> > >>>>>>>>>>> - let's assume we can determine if key is already on the
> storage
> > >>>>>>>> without doing query
> > >>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live
> the
> > >>>>>>>> percentage is lower)
> > >>>>>>>>>>> - every JSON field name is 20 bytes
> > >>>>>>>>>>> - every JSON path is 10 levels deep
> > >>>>>>>>>>> - document key prefix length is 50
> > >>>>>>>>>>> - every document has 10 revisions
> > >>>>>>>>>>> Let's estimate the storage requirements and size of data we
> need to
> > >>>>>>>> transmit. The calculations are not exact.
> > >>>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact
> numbers since
> > >>>>>>>> we don't know how FDB stores it)
> > >>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32
> bytes) =
> > >>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x)
> > >>>>>>>>>>> 2. number of independent keys to retrieve on document read
> > >>>>>>>> (non-range queries) per document
> > >>>>>>>>>>> - 1K - (1K * 1%) = 990
> > >>>>>>>>>>> 3. number of range queries: 0
> > >>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes +
> 32
> > >>>>>>>> bytes) = 102 Kb (10x)
> > >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
> > >>>>>>>> https://apple.github.io/foundationdb/performance.html)
> > >>>>>>>>>>> - sequential: 990*2ms = 1980ms
> > >>>>>>>>>>> - range: 0
> > >>>>>>>>>>> Let's compare these numbers with initial proposal (flattened
> JSON
> > >>>>>>>> docs without global schema and without cache)
> > >>>>>>>>>>> 1. storage_size_per_document
> > >>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) = 2400
> bytes
> > >>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
> > >>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10
> =
> > >>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
> > >>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on
> index
> > >>>>>>>> structure)
> > >>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result)
> > >>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 =
> 23.6 Kb
> > >>>>>>>> (2.4x)
> > >>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from
> > >>>>>>>> https://apple.github.io/foundationdb/performance.html and
> estimate range
> > >>>>>>>> read performance based on numbers from
> > >>>>>>>>
> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
> > >>>>>>>> )
> > >>>>>>>>>>> - range read performance: Given read performance is about
> 305,000
> > >>>>>>>> reads/second and range performance 3,600,000 keys/second we
> estimate range
> > >>>>>>>> performance to be 11.8x compared to read performance. If read
> performance
> > >>>>>>>> is 2ms than range performance is 0.169ms (which is hard to
> believe).
> > >>>>>>>>>>> - sequential: 2 * 2 = 4ms
> > >>>>>>>>>>> - range: 0.169
> > >>>>>>>>>>>
> > >>>>>>>>>>> It looks like we are dealing with a tradeoff:
> > >>>>>>>>>>> - Map every distinct occurrence of a key/value instance
> through a
> > >>>>>>>> crypto hash:
> > >>>>>>>>>>> - 5.39x more disk space efficient
> > >>>>>>>>>>> - 474x slower
> > >>>>>>>>>>> - flattened JSON model
> > >>>>>>>>>>> - 5.39x less efficient in disk space
> > >>>>>>>>>>> - 474x faster
> > >>>>>>>>>>>
> > >>>>>>>>>>> In any case this unscientific exercise was very helpful.
> Since it
> > >>>>>>>> uncovered the high cost in terms of disk space. 59.3x of
> original disk size
> > >>>>>>>> is too much IMO.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Are the any ways we can make Michael's model more performant?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Also I don't quite understand few aspects of the global hash
> table
> > >>>>>>>> proposal:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance
> through
> > >>>>>>>> a crypto hash function to get a set of hashes.
> > >>>>>>>>>>> I think we are talking only about scalar values here? I.e.
> > >>>>>>>> `"#/foo.bar.baz": 123`
> > >>>>>>>>>>> Since I don't know how we can make it work for all possible
> JSON
> > >>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
> > >>>>>>>>>>> - foo
> > >>>>>>>>>>> - foo.bar
> > >>>>>>>>>>> - foo.bar.baz
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. how to delete documents
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best regards,
> > >>>>>>>>>>> ILYA
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair <
> michael@daclubhouse.net>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <
> kocolosk@apache.org
> > >>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hi Michael,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Yes that’s definitely one way to address storage of edit
> > >>>>>>>> conflicts. I
> > >>>>>>>>>>>>> think there are other, more compact representations that
> we can
> > >>>>>>>> explore if
> > >>>>>>>>>>>>> we have this “exploded” data model where each scalar value
> maps
> > >>>>>>>> to an
> > >>>>>>>>>>>>> individual KV pair.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I agree, as I mentioned on the original thread, I see a
> scheme,
> > >>>>>>>> that
> > >>>>>>>>>>>> handles both conflicts and revisions, where you only have
> to store
> > >>>>>>>> the most
> > >>>>>>>>>>>> recent change to a field.  Like you suggested, multiple
> revisions
> > >>>>>>>> can share
> > >>>>>>>>>>>> a key.  Which in my mind's eye further begs the
> conflicts/revisions
> > >>>>>>>>>>>> discussion along with the working within the limits
> discussion
> > >>>>>>>> because it
> > >>>>>>>>>>>> seems to me they are all intrinsically related as a
> "feature".
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Saying 'We'll break documents up into roughly 80k
> segments', then
> > >>>>>>>> trying to
> > >>>>>>>>>>>> overlay some kind of field sharing scheme for
> revisions/conflicts
> > >>>>>>>> doesn't
> > >>>>>>>>>>>> seem like it will work.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I probably should have left out the trivial fix proposal as
> I
> > >>>>>>>> don't think
> > >>>>>>>>>>>> it's a feasible solution to actually use.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The comment is more regarding that I do not see how this
> thread
> > >>>>>>>> can escape
> > >>>>>>>>>>>> including how to store/retrieve conflicts/revisions.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends
> itself
> > >>>>>>>> to value
> > >>>>>>>>>>>> sharing across mutiple documents (and I don't just mean
> revisions
> > >>>>>>>> of the
> > >>>>>>>>>>>> same doc, I mean the same key/value instance could be
> shared for
> > >>>>>>>> every
> > >>>>>>>>>>>> document).
> > >>>>>>>>>>>> However that's not really relevant if we're not considering
> the
> > >>>>>>>> amount of
> > >>>>>>>>>>>> shared information across documents in the storage scheme.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some
> kind of
> > >>>>>>>>>>>> compressed binary representation) to deal with that FDB
> limit
> > >>>>>>>> seems fine.
> > >>>>>>>>>>>> The only reason to consider doing something else is because
> of its
> > >>>>>>>> impact
> > >>>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk
> size
> > >>>>>>>> impact,
> > >>>>>>>>>>>> etc.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the
> > >>>>>>>> document into
> > >>>>>>>>>>>>> an array and then request the value of each key as multiple
> > >>>>>>>> parallel
> > >>>>>>>>>>>>> queries against FDB at once
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s
> trying
> > >>>>>>>> to design a
> > >>>>>>>>>>>>> model which allows the retrieval of a document with a
> single
> > >>>>>>>> range read,
> > >>>>>>>>>>>>> which is a good goal in my opinion.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I am not sure I agree.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Think of bitTorrent, a single range read should pull back
> the
> > >>>>>>>> structure of
> > >>>>>>>>>>>> the document (the pieces to fetch), but not necessarily the
> whole
> > >>>>>>>> document.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What if you already have a bunch of pieces in common with
> other
> > >>>>>>>> documents
> > >>>>>>>>>>>> locally (a repeated header/footer/ or type for example);
> and you
> > >>>>>>>> only need
> > >>>>>>>>>>>> to get a few pieces of data you don't already have?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The real goal to Couch I see is to treat your document set
> like the
> > >>>>>>>>>>>> collection of structured information that it is.  In some
> respects
> > >>>>>>>> like an
> > >>>>>>>>>>>> extension of your application's heap space for structured
> objects
> > >>>>>>>> and
> > >>>>>>>>>>>> efficiently querying that collection to get back subsets of
> the
> > >>>>>>>> data.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Otherwise it seems more like a slightly upgraded file
> system plus
> > >>>>>>>> a fancy
> > >>>>>>>>>>>> grep/find like feature...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The best way I see to unlock more features/power is to a
> move
> > >>>>>>>> towards a
> > >>>>>>>>>>>> more granular and efficient way to store and retrieve the
> scalar
> > >>>>>>>> values...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For example, hears a crazy thought:
> > >>>>>>>>>>>> Map every distinct occurence of a key/value instance
> through a
> > >>>>>>>> crypto hash
> > >>>>>>>>>>>> function to get a set of hashes.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> These can be be precomputed by Couch without any lookups in
> FDB.
> > >>>>>>>> These
> > >>>>>>>>>>>> will be spread all over kingdom come in FDB and not lend
> > >>>>>>>> themselves to
> > >>>>>>>>>>>> range search well.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So what you do is index them for frequency of occurring in
> the
> > >>>>>>>> same set.
> > >>>>>>>>>>>> In essence, you 'bucket them' statistically, and that
> bucket id
> > >>>>>>>> becomes a
> > >>>>>>>>>>>> key prefix. A crypto hash value can be copied into more
> than one
> > >>>>>>>> bucket.
> > >>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> When writing a document, Couch submits the list/array of
> > >>>>>>>> cryptohash values
> > >>>>>>>>>>>> it computed to FDB and gets back the corresponding
> {val_id} (the
> > >>>>>>>> id with
> > >>>>>>>>>>>> the bucket prefixed).  This can get somewhat expensive if
> there's
> > >>>>>>>> always a
> > >>>>>>>>>>>> lot of app local cache misses.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> A document's value is then a series of {val_id} arrays up
> to 100k
> > >>>>>>>> per
> > >>>>>>>>>>>> segment.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> When retrieving a document, you get the val_ids, find the
> distinct
> > >>>>>>>> buckets
> > >>>>>>>>>>>> and min/max entries for this doc, and then parallel query
> each
> > >>>>>>>> bucket while
> > >>>>>>>>>>>> reconstructing the document.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The values returned from the buckets query are the key/value
> > >>>>>>>> strings
> > >>>>>>>>>>>> required to reassemble this document.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> ----------
> > >>>>>>>>>>>> I put this forward primarily to hilite the idea that trying
> to
> > >>>>>>>> match the
> > >>>>>>>>>>>> storage representation of documents in a straight forward
> way to
> > >>>>>>>> FDB keys
> > >>>>>>>>>>>> to reduce query count might not be the most performance
> oriented
> > >>>>>>>> approach.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'd much prefer a storage approach that reduced data
> duplication
> > >>>>>>>> and
> > >>>>>>>>>>>> enabled fast sub-document queries.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This clearly falls in the realm of what people want the
> 'use case'
> > >>>>>>>> of Couch
> > >>>>>>>>>>>> to be/become.  By giving Couch more access to sub-document
> > >>>>>>>> queries, I could
> > >>>>>>>>>>>> eventually see queries as complicated as GraphQL submitted
> to
> > >>>>>>>> Couch and
> > >>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple
> documents in a
> > >>>>>>>> single
> > >>>>>>>>>>>> application layer request.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hehe - one way to look at the database of Couch documents
> is that
> > >>>>>>>> they are
> > >>>>>>>>>>>> all conflict revisions of the single root empty document.
>  What I
> > >>>>>>>> mean be
> > >>>>>>>>>>>> this is consider thinking of the entire document store as
> one
> > >>>>>>>> giant DAG of
> > >>>>>>>>>>>> key/value pairs. How even separate documents are still
> typically
> > >>>>>>>> related to
> > >>>>>>>>>>>> each other.  For most applications there is a tremendous
> amount of
> > >>>>>>>> data
> > >>>>>>>>>>>> redundancy between docs and especially between revisions of
> those
> > >>>>>>>> docs...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And all this is a long way of saying "I think there could
> be a lot
> > >>>>>>>> of value
> > >>>>>>>>>>>> in assuming documents are 'assembled' from multiple queries
> to
> > >>>>>>>> FDB, with
> > >>>>>>>>>>>> local caching, instead of simply retrieved"
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this
> way!?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Mike :-)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message