couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Lehnardt <...@apache.org>
Subject Re: [DISCUSS] : things we need to solve/decide : storing JSON documents
Date Tue, 19 Feb 2019 17:03:30 GMT
I like the idea from a “trying a simple thing first” perspective, but Nick’s points below
are especially convincing to with this for now.

Best
Jan
—

> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatamane@gmail.com> wrote:
> 
> Hi,
> 
> Sorry for jumping in so late, I was following from the sidelines mostly. A
> lot of good discussion happening and am excited about the possibilities
> here.
> 
> I do like the simpler "chunking" approach for a few reasons:
> 
> * Most documents bodies are probably going to be smaller than 100k. So in
> the majority of case it would be one write / one read to update and fetch
> the document body.
> 
> * We could reuse the chunking code for attachment handling and possibly
> revision key trees. So it's the general pattern of upload chunks to some
> prefix, and when finished flip an atomic toggle to make it current.
> 
> * Do the same thing with revision trees and we could re-use the revision
> tree manipulation logic. That is, the key tree in most cases would be small
> enough to fit in 100k but if they get huge, they'd get chunked. This would
> allow us to reuse all the battle tested couch_key_tree code mostly as is.
> We even have property tests for it
> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
> 
> * It removes the need to explain the max exploded path length limitation to
> customers.
> 
> Cheers,
> -Nick
> 
> 
> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnewson@apache.org> wrote:
> 
>> Hi,
>> 
>> An alternative storage model that we should seriously consider is to
>> follow our current approach in couch_file et al. Specifically, that the
>> document _body_ is stored as an uninterpreted binary value. This would be
>> much like the obvious plan for attachment storage; a key prefix that
>> identifies the database and document, with the final item of that key tuple
>> is an incrementing integer. Each of those keys has a binary value of up to
>> 100k. Fetching all values with that key prefix, in fdb's natural ordering,
>> will yield the full document body, which can be JSON decoded for further
>> processing.
>> 
>> I like this idea, and I like Adam's original proposal to explode documents
>> into property paths. I have a slight preference for the simplicity of the
>> idea in the previous paragraph, not least because it's close to what we do
>> today. I also think it will be possible to migrate to alternative storage
>> models in future, and foundationdb's transaction supports means we can do
>> this migration seamlessly should we come to it.
>> 
>> I'm very interested in knowing if anyone else is interested in going this
>> simple, or considers it a wasted opportunity relative to the 'exploded'
>> path.
>> 
>> B.
>> 
>> --
>>  Robert Samuel Newson
>>  rnewson@apache.org
>> 
>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
>>> I've been remiss here in not posting the data model ideas that IBM
>>> worked up while we were thinking about using FoundationDB so I'm posting
>>> it now. This is Adam' Kocoloski's original work, I am just transcribing
>>> it, and this is the context that the folks from the IBM side came in
>>> with, for full disclosure.
>>> 
>>> Basics
>>> 
>>> 1. All CouchDB databases are inside a Directory
>>> 2. Each CouchDB database is a Directory within that Directory
>>> 3. It's possible to list all subdirectories of a Directory, so
>>> `_all_dbs` is the list of directories from 1.
>>> 4. Each Directory representing a CouchdB database has several Subspaces;
>>> 4a. by_id/ doc subspace: actual document contents
>>> 4b. by_seq/versionstamp subspace: for the _changes feed
>>> 4c. index_definitions, indexes, ...
>>> 
>>> JSON Mapping
>>> 
>>> A hierarchical JSON object naturally maps to multiple KV pairs in FDB:
>>> 
>>> {
>>>    “_id”: “foo”,
>>>    “owner”: “bob”,
>>>    “mylist”: [1,3,5],
>>>    “mymap”: {
>>>        “blue”: “#0000FF”,
>>>        “red”: “#FF0000”
>>>    }
>>> }
>>> 
>>> maps to
>>> 
>>> (“foo”, “owner”) = “bob”
>>> (“foo”, “mylist”, 0) = 1
>>> (“foo”, “mylist”, 1) = 3
>>> (“foo”, “mylist”, 2) = 5
>>> (“foo”, “mymap”, “blue”) = “#0000FF”
>>> (“foo”, “mymap”, “red”) = “#FF0000”
>>> 
>>> NB: this means that the 100KB limit applies to individual leafs in the
>>> JSON object, not the entire doc
>>> 
>>> Edit Conflicts
>>> 
>>> We need to account for the presence of conflicts in various levels of
>>> the doc due to replication.
>>> 
>>> Proposal is to create a special value indicating that the subtree below
>>> our current cursor position is in an unresolvable conflict. Then add
>>> additional KV pairs below to describe the conflicting entries.
>>> 
>>> KV data model allows us to store these efficiently and minimize
>>> duplication of data:
>>> 
>>> A document with these two conflicts:
>>> 
>>> {
>>>    “_id”: “foo”,
>>>    “_rev”: “1-abc”,
>>>    “owner”: “alice”,
>>>    “active”: true
>>> }
>>> {
>>>    “_id”: “foo”,
>>>    “_rev”: “1-def”,
>>>    “owner”: “bob”,
>>>    “active”: true
>>> }
>>> 
>>> could be stored thus:
>>> 
>>> (“foo”, “active”) = true
>>> (“foo”, “owner”) = kCONFLICT
>>> (“foo”, “owner”, “1-abc”) = “alice”
>>> (“foo”, “owner”, “1-def”) = “bob”
>>> 
>>> So long as `kCONFLICT` is set at the top of the conflicting subtree this
>>> representation can handle conflicts of different data types as well.
>>> 
>>> Missing fields need to be handled explicitly:
>>> 
>>> {
>>>  “_id”: “foo”,
>>>  “_rev”: “1-abc”,
>>>  “owner”: “alice”,
>>>  “active”: true
>>> }
>>> 
>>> {
>>>  “_id”: “foo”,
>>>  “_rev”: “1-def”,
>>>  “owner”: {
>>>    “name”: “bob”,
>>>    “email”: “
>>> bob@example.com
>>> "
>>>  }
>>> }
>>> 
>>> could be stored thus:
>>> 
>>> (“foo”, “active”) = kCONFLICT
>>> (“foo”, “active”, “1-abc”) = true
>>> (“foo”, “active”, “1-def”) = kMISSING
>>> (“foo”, “owner”) = kCONFLICT
>>> (“foo”, “owner”, “1-abc”) = “alice”
>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
>>> (“foo”, “owner”, “1-def”, “email”) = ...
>>> 
>>> Revision Metadata
>>> 
>>> * CouchDB uses a hash history for revisions
>>> ** Each edit is identified by the hash of the content of the edit
>>> including the base revision against which it was applied
>>> ** Individual edit branches are bounded in length but the number of
>>> branches is potentially unbounded
>>> 
>>> * Size limits preclude us from storing the entire key tree as a single
>>> value; in pathological situations
>>> the tree could exceed 100KB (each entry is > 16 bytes)
>>> 
>>> * Store each edit branch as a separate KV including deleted status in a
>>> special subspace
>>> 
>>> * Structure key representation so that “winning” revision can be
>>> automatically retrieved in a limit=1
>>> key range operation
>>> 
>>> (“foo”, “_meta”, “deleted=false”, 1, “def”) = []
>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”) = [“3-baz”,”2-bar”,”1-foo”]
>>> <-- winner
>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”) = [“2-bar”, “1-foo”]
>>> 
>>> Changes Feed
>>> 
>>> * FDB supports a concept called a versionstamp — a 10 byte, unique,
>>> monotonically (but not sequentially) increasing value for each committed
>>> transaction. The first 8 bytes are the committed version of the
>>> database. The last 2 bytes are monotonic in the serialization order for
>>> transactions.
>>> 
>>> * A transaction can specify a particular index into a key where the
>>> following 10 bytes will be overwritten by the versionstamp at commit
>>> time
>>> 
>>> * A subspace keyed on versionstamp naturally yields a _changes feed
>>> 
>>> by_seq subspace
>>>  (“versionstamp1”) = (“foo”, “1-abc”)
>>>  (“versionstamp4”) = (“bar”, “4-def”)
>>> 
>>> by_id subspace
>>>  (“bar”, “_vsn”) = “versionstamp4”
>>>  ...
>>>  (“foo”, “_vsn”) = “versionstamp1”
>>> 
>>> JSON Indexes
>>> 
>>> * “Mango” JSON indexes are defined by
>>> ** a list of field names, each of which may be nested,
>>> ** an optional partial_filter_selector which constrains the set of docs
>>> that contribute
>>> ** an optional name defined by the ddoc field (the name is auto-
>>> generated if not supplied)
>>> 
>>> * Store index definitions in a single subspace to aid query planning
>>> ** ((person,name), title, email) = (“name-title-email”, “{“student”:
>>> true}”)
>>> ** Store the values for each index in a dedicated subspace, adding the
>>> document ID as the last element in the tuple
>>> *** (“rosie revere”, “engineer”, “rosie@example.com", “foo”) =
null
>>> 
>>> B.
>>> 
>>> --
>>>  Robert Samuel Newson
>>>  rnewson@apache.org
>>> 
>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
>>>> 
>>>> I want to fix previous mistakes. I did two mistakes in previous
>>>> calculations:
>>>> - I used 1Kb as base size for calculating expansion factor (although
>> we
>>>> don't know exact size of original document)
>>>> - The expansion factor calculation included number of revisions (it
>>>> shouldn't)
>>>> 
>>>> I'll focus on flattened JSON docs model
>>>> 
>>>> The following formula is used in previous calculation.
>>>> storage_size_per_document=mapping_table_size*number_of_revisions +
>>>> depth*number_of_paths*number_of_revisions +
>>>> number_of_paths*value_size*number_of_revisions
>>>> 
>>>> To clarify things a little bit I want to calculate space requirement
>> for
>>>> single revision this time.
>>>> mapping_table_size=field_name_size*(field_name_length+4(integer
>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size
>> +
>>>> depth*number_of_paths + value_size*number_of_paths =
>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
>>>> 
>>>> We definitely can reduce requirement for mapping table by adopting
>>>> rnewson's idea of a schema.
>>>> 
>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iilyak@apache.org> wrote:
>>>>> Hi Michael,
>>>>> 
>>>>>> For example, hears a crazy thought:
>>>>>> Map every distinct occurence of a key/value instance through a
>> crypto hash
>>>>>> function to get a set of hashes.
>>>>>> 
>>>>>> These can be be precomputed by Couch without any lookups in FDB.
>> These
>>>>>> will be spread all over kingdom come in FDB and not lend
>> themselves to
>>>>>> range search well.
>>>>>> 
>>>>>> So what you do is index them for frequency of occurring in the
>> same set.
>>>>>> In essence, you 'bucket them' statistically, and that bucket id
>> becomes a
>>>>>> key prefix. A crypto hash value can be copied into more than one
>> bucket.
>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
>>>>> 
>>>>>> When writing a document, Couch submits the list/array of
>> cryptohash values
>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
>> id with
>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
>> always a
>>>>>> lot of app local cache misses.
>>>>>> 
>>>>>> A document's value is then a series of {val_id} arrays up to 100k
>> per
>>>>>> segment.
>>>>>> 
>>>>>> When retrieving a document, you get the val_ids, find the distinct
>> buckets
>>>>>> and min/max entries for this doc, and then parallel query each
>> bucket while
>>>>>> reconstructing the document.
>>>>> 
>>>>> Interesting idea. Let's try to think it through to see if we can
>> make it viable.
>>>>> Let's go through hypothetical example. Input data for the example:
>>>>> - 1M of documents
>>>>> - each document is around 10Kb
>>>>> - each document consists of 1K of unique JSON paths
>>>>> - each document has 100 unique JSON field names
>>>>> - every scalar value is 100 bytes
>>>>> - 10% of unique JSON paths for every document already stored in
>> database under different doc or different revision of the current one
>>>>> - we assume 3 independent copies for every key-value pair in FDB
>>>>> - our hash key size is 32 bytes
>>>>> - let's assume we can determine if key is already on the storage
>> without doing query
>>>>> - 1% of paths is in cache (unrealistic value, in real live the
>> percentage is lower)
>>>>> - every JSON field name is 20 bytes
>>>>> - every JSON path is 10 levels deep
>>>>> - document key prefix length is 50
>>>>> - every document has 10 revisions
>>>>> Let's estimate the storage requirements and size of data we need to
>> transmit. The calculations are not exact.
>>>>> 1. storage_size_per_document (we cannot estimate exact numbers since
>> we don't know how FDB stores it)
>>>>>  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) =
>> 38Kb * 10 * 3 = 1140 Kb (11x)
>>>>> 2. number of independent keys to retrieve on document read
>> (non-range queries) per document
>>>>>  - 1K - (1K * 1%) = 990
>>>>> 3. number of range queries: 0
>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32
>> bytes) = 102 Kb (10x)
>>>>> 5. read latency (we use 2ms per read based on numbers from
>> https://apple.github.io/foundationdb/performance.html)
>>>>>    - sequential: 990*2ms = 1980ms
>>>>>    - range: 0
>>>>> Let's compare these numbers with initial proposal (flattened JSON
>> docs without global schema and without cache)
>>>>> 1. storage_size_per_document
>>>>>  - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
>>>>>  - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes
>>>>>  - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 =
>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
>>>>> 2. number of independent keys to retrieve: 0-2 (depending on index
>> structure)
>>>>> 3. number of range queries: 1 (1001 of keys in result)
>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb
>> (2.4x)
>>>>> 5. read latency (we use 2ms per read based on numbers from
>> https://apple.github.io/foundationdb/performance.html and estimate range
>> read performance based on numbers from
>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
>> )
>>>>>  - range read performance: Given read performance is about 305,000
>> reads/second and range performance 3,600,000 keys/second we estimate range
>> performance to be 11.8x compared to read performance. If read performance
>> is 2ms than range performance is 0.169ms (which is hard to believe).
>>>>>  - sequential: 2 * 2 = 4ms
>>>>>  - range: 0.169
>>>>> 
>>>>> It looks like we are dealing with a tradeoff:
>>>>> - Map every distinct occurrence of a key/value instance through a
>> crypto hash:
>>>>>  - 5.39x more disk space efficient
>>>>>  - 474x slower
>>>>> - flattened JSON model
>>>>>  - 5.39x less efficient in disk space
>>>>>  - 474x faster
>>>>> 
>>>>> In any case this unscientific exercise was very helpful. Since it
>> uncovered the high cost in terms of disk space. 59.3x of original disk size
>> is too much IMO.
>>>>> 
>>>>> Are the any ways we can make Michael's model more performant?
>>>>> 
>>>>> Also I don't quite understand few aspects of the global hash table
>> proposal:
>>>>> 
>>>>> 1. > - Map every distinct occurence of a key/value instance through
>> a crypto hash function to get a set of hashes.
>>>>> I think we are talking only about scalar values here? I.e.
>> `"#/foo.bar.baz": 123`
>>>>> Since I don't know how we can make it work for all possible JSON
>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
>>>>> - foo
>>>>> - foo.bar
>>>>> - foo.bar.baz
>>>>> 
>>>>> 2. how to delete documents
>>>>> 
>>>>> Best regards,
>>>>> ILYA
>>>>> 
>>>>> 
>>>>> On 2019/01/30 23:33:22, Michael Fair <michael@daclubhouse.net>
>> wrote:
>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocolosk@apache.org
>> wrote:
>>>>>> 
>>>>>>> Hi Michael,
>>>>>>> 
>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
>>>>>>> 
>>>>>>> Yes that’s definitely one way to address storage of edit
>> conflicts. I
>>>>>>> think there are other, more compact representations that we can
>> explore if
>>>>>>> we have this “exploded” data model where each scalar value
maps
>> to an
>>>>>>> individual KV pair.
>>>>>> 
>>>>>> 
>>>>>> I agree, as I mentioned on the original thread, I see a scheme,
>> that
>>>>>> handles both conflicts and revisions, where you only have to store
>> the most
>>>>>> recent change to a field.  Like you suggested, multiple revisions
>> can share
>>>>>> a key.  Which in my mind's eye further begs the conflicts/revisions
>>>>>> discussion along with the working within the limits discussion
>> because it
>>>>>> seems to me they are all intrinsically related as a "feature".
>>>>>> 
>>>>>> Saying 'We'll break documents up into roughly 80k segments', then
>> trying to
>>>>>> overlay some kind of field sharing scheme for revisions/conflicts
>> doesn't
>>>>>> seem like it will work.
>>>>>> 
>>>>>> I probably should have left out the trivial fix proposal as I
>> don't think
>>>>>> it's a feasible solution to actually use.
>>>>>> 
>>>>>> The comment is more regarding that I do not see how this thread
>> can escape
>>>>>> including how to store/retrieve conflicts/revisions.
>>>>>> 
>>>>>> For instance, the 'doc as individual fields' proposal lends itself
>> to value
>>>>>> sharing across mutiple documents (and I don't just mean revisions
>> of the
>>>>>> same doc, I mean the same key/value instance could be shared for
>> every
>>>>>> document).
>>>>>> However that's not really relevant if we're not considering the
>> amount of
>>>>>> shared information across documents in the storage scheme.
>>>>>> 
>>>>>> Simply storing documents in <100k segments (perhaps in some kind
of
>>>>>> compressed binary representation) to deal with that FDB limit
>> seems fine.
>>>>>> The only reason to consider doing something else is because of its
>> impact
>>>>>> to indexing, searches, reduce functions, revisions, on-disk size
>> impact,
>>>>>> etc.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>>> I'm assuming the process will flatten the key paths of the
>> document into
>>>>>>> an array and then request the value of each key as multiple
>> parallel
>>>>>>> queries against FDB at once
>>>>>>> 
>>>>>>> Ah, I think this is not one of Ilya’s assumptions. He’s trying
>> to design a
>>>>>>> model which allows the retrieval of a document with a single
>> range read,
>>>>>>> which is a good goal in my opinion.
>>>>>>> 
>>>>>> 
>>>>>> I am not sure I agree.
>>>>>> 
>>>>>> Think of bitTorrent, a single range read should pull back the
>> structure of
>>>>>> the document (the pieces to fetch), but not necessarily the whole
>> document.
>>>>>> 
>>>>>> What if you already have a bunch of pieces in common with other
>> documents
>>>>>> locally (a repeated header/footer/ or type for example); and you
>> only need
>>>>>> to get a few pieces of data you don't already have?
>>>>>> 
>>>>>> The real goal to Couch I see is to treat your document set like the
>>>>>> collection of structured information that it is.  In some respects
>> like an
>>>>>> extension of your application's heap space for structured objects
>> and
>>>>>> efficiently querying that collection to get back subsets of the
>> data.
>>>>>> 
>>>>>> Otherwise it seems more like a slightly upgraded file system plus
>> a fancy
>>>>>> grep/find like feature...
>>>>>> 
>>>>>> The best way I see to unlock more features/power is to a move
>> towards a
>>>>>> more granular and efficient way to store and retrieve the scalar
>> values...
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> For example, hears a crazy thought:
>>>>>> Map every distinct occurence of a key/value instance through a
>> crypto hash
>>>>>> function to get a set of hashes.
>>>>>> 
>>>>>> These can be be precomputed by Couch without any lookups in FDB.
>> These
>>>>>> will be spread all over kingdom come in FDB and not lend
>> themselves to
>>>>>> range search well.
>>>>>> 
>>>>>> So what you do is index them for frequency of occurring in the
>> same set.
>>>>>> In essence, you 'bucket them' statistically, and that bucket id
>> becomes a
>>>>>> key prefix. A crypto hash value can be copied into more than one
>> bucket.
>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
>>>>>> 
>>>>>> When writing a document, Couch submits the list/array of
>> cryptohash values
>>>>>> it computed to FDB and gets back the corresponding  {val_id} (the
>> id with
>>>>>> the bucket prefixed).  This can get somewhat expensive if there's
>> always a
>>>>>> lot of app local cache misses.
>>>>>> 
>>>>>> 
>>>>>> A document's value is then a series of {val_id} arrays up to 100k
>> per
>>>>>> segment.
>>>>>> 
>>>>>> When retrieving a document, you get the val_ids, find the distinct
>> buckets
>>>>>> and min/max entries for this doc, and then parallel query each
>> bucket while
>>>>>> reconstructing the document.
>>>>>> 
>>>>>> The values returned from the buckets query are the key/value
>> strings
>>>>>> required to reassemble this document.
>>>>>> 
>>>>>> 
>>>>>> ----------
>>>>>> I put this forward primarily to hilite the idea that trying to
>> match the
>>>>>> storage representation of documents in a straight forward way to
>> FDB keys
>>>>>> to reduce query count might not be the most performance oriented
>> approach.
>>>>>> 
>>>>>> I'd much prefer a storage approach that reduced data duplication
>> and
>>>>>> enabled fast sub-document queries.
>>>>>> 
>>>>>> 
>>>>>> This clearly falls in the realm of what people want the 'use case'
>> of Couch
>>>>>> to be/become.  By giving Couch more access to sub-document
>> queries, I could
>>>>>> eventually see queries as complicated as GraphQL submitted to
>> Couch and
>>>>>> pulling back ad-hoc aggregated data across multiple documents in
a
>> single
>>>>>> application layer request.
>>>>>> 
>>>>>> Hehe - one way to look at the database of Couch documents is that
>> they are
>>>>>> all conflict revisions of the single root empty document.   What
I
>> mean be
>>>>>> this is consider thinking of the entire document store as one
>> giant DAG of
>>>>>> key/value pairs. How even separate documents are still typically
>> related to
>>>>>> each other.  For most applications there is a tremendous amount of
>> data
>>>>>> redundancy between docs and especially between revisions of those
>> docs...
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> And all this is a long way of saying "I think there could be a lot
>> of value
>>>>>> in assuming documents are 'assembled' from multiple queries to
>> FDB, with
>>>>>> local caching, instead of simply retrieved"
>>>>>> 
>>>>>> Thanks, I hope I'm not the only outlier here thinking this way!?
>>>>>> 
>>>>>> Mike :-)
>>>>>> 
>>>>> 
>> 


Mime
View raw message