couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Samuel Newson <rnew...@apache.org>
Subject Re: [DISCUSS] : things we need to solve/decide : storing JSON documents
Date Tue, 19 Feb 2019 21:39:14 GMT
Interesting suggestion, obviously the details might get the wrong kind of fun.

Somewhere above I suggested this would be something we could change over time and even use
different approaches for different documents within the same database. This is the long way
of saying there are multiple ways to do this each with advantages and none without disadvantages.

I don’t think adding a layer of abstraction is the right move just yet, I think we should
continue to find consensus on one answer to this question (and the related ones in other threads)
for the first release. It’s easy to say “we can change it later”, of course. We can,
though it would be a chunk of work in the context of something that already works, I’ve
rarely seen anyone sign up for that.

I’m fine with the first proposal from Adam, where the keys are tuples of key parts pointing
at terminal values. To make it easier for the first version, I would exclude optimisations
like deduplication or the Directory aliasing or the schema thing that I suggested and that
Ilya incorporated a variant of in a follow-up post. We’d accept that there are limits on
the sizes of documents, including the awkward-to-express one about property depth.

Stepping back, I’m not seeing any essential improvement over Adam’s original proposal
besides the few corrections and clarifications made by various authors. Could we start an
RFC based on Adam’s original proposal on document body, revision tree and index storage?
We could then have PR’s against that for each additional optimisation (one person’s optimisation
is another person’s needless complication)?

If I’ve missed some genuine advance on the original proposal in this long thread, please
call it out for me.

B.

> On 19 Feb 2019, at 21:15, Benjamin Anderson <banjiewen@apache.org> wrote:
> 
> As is evident by the length of this thread, there's a pretty big
> design space to cover here, and it seems unlikely we'll have arrived
> at a "correct" solution even by the time this thing ships. Perhaps it
> would be worthwhile to treat the in-FDB representation of data as a
> first-class abstraction and support multiple representations
> simultaneously?
> 
> Obviously there's no such thing as a zero-cost abstraction - and I've
> not thought very hard about how far up the stack the document
> representation would need to leak - but supporting different layouts
> (primarily, as Adam points out, on the document body itself) might
> prove interesting and useful. I'm sure there are folks interested in a
> column-shaped CouchDB, for example.
> 
> --
> b
> 
> On Tue, Feb 19, 2019 at 11:39 AM Robert Newson <rnewson@apache.org> wrote:
>> 
>> Good points on revtree, I agree with you we should store that intelligently to gain
the benefits you mentioned.
>> 
>> --
>>  Robert Samuel Newson
>>  rnewson@apache.org
>> 
>> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote:
>>> I do not think we should store the revtree as a blob. The design where
>>> each edit branch is its own KV should save on network IO and CPU cycles
>>> for normal updates. We’ve performed too many heroics to keep
>>> couch_key_tree from stalling entire databases when trying to update a
>>> single document with a wide revision tree, I would much prefer to ignore
>>> other edit branches entirely when all we’re doing is extending one of
>>> them.
>>> 
>>> I also do not think we should store JSON documents as blobs, but it’s a
>>> closer call. Some of my reasoning for preferring the exploded path
>>> design:
>>> 
>>> - it lends itself nicely to sub-document operations, for which Jan
>>> crafted an RFC last year: https://github.com/apache/couchdb/issues/1559
>>> - it optimizes the creation of Mango indexes on existing databases since
>>> we only need to retrieve the value(s) we want to index
>>> - it optimizes Mango queries that use field selectors
>>> - anyone who wanted to try their hand at GraphQL will find it very
>>> handy: https://github.com/apache/couchdb/issues/1499
>>> - looking further ahead, it lets us play with smarter leaf value types
>>> like Counters (yes I’m still on the CRDT bandwagon, sorry)
>>> 
>>> A few comments on the thread:
>>> 
>>>>>> * Most documents bodies are probably going to be smaller than 100k.
So in
>>>>>> the majority of case it would be one write / one read to update and
fetch
>>>>>> the document body.
>>> 
>>> We should test, but I expect reading 50KB of data in a range query is
>>> almost as efficient as reading a single 50 KB value. Similarly, writes
>>> to a contiguous set of keys should be quite efficient.
>>> 
>>> I am concerned about the overhead of the repeated field paths in the
>>> keys with the exploded path option in the absence of key prefix
>>> compression. That would be my main reason to acquiesce and throw away
>>> all the document structure.
>>> 
>>> Adam
>>> 
>>>> On Feb 19, 2019, at 12:04 PM, Robert Newson <rnewson@apache.org> wrote:
>>>> 
>>>> I like the idea that we'd reuse the same pattern (but perhaps not the same
_code_) for doc bodies, revtree and attachments.
>>>> 
>>>> I hope we still get to delete couch_key_tree.erl, though.
>>>> 
>>>> --
>>>> Robert Samuel Newson
>>>> rnewson@apache.org
>>>> 
>>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote:
>>>>> I like the idea from a “trying a simple thing first” perspective,
but
>>>>> Nick’s points below are especially convincing to with this for now.
>>>>> 
>>>>> Best
>>>>> Jan
>>>>> —
>>>>> 
>>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc <vatamane@gmail.com>
wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> Sorry for jumping in so late, I was following from the sidelines
mostly. A
>>>>>> lot of good discussion happening and am excited about the possibilities
>>>>>> here.
>>>>>> 
>>>>>> I do like the simpler "chunking" approach for a few reasons:
>>>>>> 
>>>>>> * Most documents bodies are probably going to be smaller than 100k.
So in
>>>>>> the majority of case it would be one write / one read to update and
fetch
>>>>>> the document body.
>>>>>> 
>>>>>> * We could reuse the chunking code for attachment handling and possibly
>>>>>> revision key trees. So it's the general pattern of upload chunks
to some
>>>>>> prefix, and when finished flip an atomic toggle to make it current.
>>>>>> 
>>>>>> * Do the same thing with revision trees and we could re-use the revision
>>>>>> tree manipulation logic. That is, the key tree in most cases would
be small
>>>>>> enough to fit in 100k but if they get huge, they'd get chunked. This
would
>>>>>> allow us to reuse all the battle tested couch_key_tree code mostly
as is.
>>>>>> We even have property tests for it
>>>>>> https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tree_prop_tests.erl
>>>>>> 
>>>>>> * It removes the need to explain the max exploded path length limitation
to
>>>>>> customers.
>>>>>> 
>>>>>> Cheers,
>>>>>> -Nick
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson <rnewson@apache.org>
wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> An alternative storage model that we should seriously consider
is to
>>>>>>> follow our current approach in couch_file et al. Specifically,
that the
>>>>>>> document _body_ is stored as an uninterpreted binary value. This
would be
>>>>>>> much like the obvious plan for attachment storage; a key prefix
that
>>>>>>> identifies the database and document, with the final item of
that key tuple
>>>>>>> is an incrementing integer. Each of those keys has a binary value
of up to
>>>>>>> 100k. Fetching all values with that key prefix, in fdb's natural
ordering,
>>>>>>> will yield the full document body, which can be JSON decoded
for further
>>>>>>> processing.
>>>>>>> 
>>>>>>> I like this idea, and I like Adam's original proposal to explode
documents
>>>>>>> into property paths. I have a slight preference for the simplicity
of the
>>>>>>> idea in the previous paragraph, not least because it's close
to what we do
>>>>>>> today. I also think it will be possible to migrate to alternative
storage
>>>>>>> models in future, and foundationdb's transaction supports means
we can do
>>>>>>> this migration seamlessly should we come to it.
>>>>>>> 
>>>>>>> I'm very interested in knowing if anyone else is interested in
going this
>>>>>>> simple, or considers it a wasted opportunity relative to the
'exploded'
>>>>>>> path.
>>>>>>> 
>>>>>>> B.
>>>>>>> 
>>>>>>> --
>>>>>>> Robert Samuel Newson
>>>>>>> rnewson@apache.org
>>>>>>> 
>>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote:
>>>>>>>> I've been remiss here in not posting the data model ideas
that IBM
>>>>>>>> worked up while we were thinking about using FoundationDB
so I'm posting
>>>>>>>> it now. This is Adam' Kocoloski's original work, I am just
transcribing
>>>>>>>> it, and this is the context that the folks from the IBM side
came in
>>>>>>>> with, for full disclosure.
>>>>>>>> 
>>>>>>>> Basics
>>>>>>>> 
>>>>>>>> 1. All CouchDB databases are inside a Directory
>>>>>>>> 2. Each CouchDB database is a Directory within that Directory
>>>>>>>> 3. It's possible to list all subdirectories of a Directory,
so
>>>>>>>> `_all_dbs` is the list of directories from 1.
>>>>>>>> 4. Each Directory representing a CouchdB database has several
Subspaces;
>>>>>>>> 4a. by_id/ doc subspace: actual document contents
>>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed
>>>>>>>> 4c. index_definitions, indexes, ...
>>>>>>>> 
>>>>>>>> JSON Mapping
>>>>>>>> 
>>>>>>>> A hierarchical JSON object naturally maps to multiple KV
pairs in FDB:
>>>>>>>> 
>>>>>>>> {
>>>>>>>>  “_id”: “foo”,
>>>>>>>>  “owner”: “bob”,
>>>>>>>>  “mylist”: [1,3,5],
>>>>>>>>  “mymap”: {
>>>>>>>>      “blue”: “#0000FF”,
>>>>>>>>      “red”: “#FF0000”
>>>>>>>>  }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> maps to
>>>>>>>> 
>>>>>>>> (“foo”, “owner”) = “bob”
>>>>>>>> (“foo”, “mylist”, 0) = 1
>>>>>>>> (“foo”, “mylist”, 1) = 3
>>>>>>>> (“foo”, “mylist”, 2) = 5
>>>>>>>> (“foo”, “mymap”, “blue”) = “#0000FF”
>>>>>>>> (“foo”, “mymap”, “red”) = “#FF0000”
>>>>>>>> 
>>>>>>>> NB: this means that the 100KB limit applies to individual
leafs in the
>>>>>>>> JSON object, not the entire doc
>>>>>>>> 
>>>>>>>> Edit Conflicts
>>>>>>>> 
>>>>>>>> We need to account for the presence of conflicts in various
levels of
>>>>>>>> the doc due to replication.
>>>>>>>> 
>>>>>>>> Proposal is to create a special value indicating that the
subtree below
>>>>>>>> our current cursor position is in an unresolvable conflict.
Then add
>>>>>>>> additional KV pairs below to describe the conflicting entries.
>>>>>>>> 
>>>>>>>> KV data model allows us to store these efficiently and minimize
>>>>>>>> duplication of data:
>>>>>>>> 
>>>>>>>> A document with these two conflicts:
>>>>>>>> 
>>>>>>>> {
>>>>>>>>  “_id”: “foo”,
>>>>>>>>  “_rev”: “1-abc”,
>>>>>>>>  “owner”: “alice”,
>>>>>>>>  “active”: true
>>>>>>>> }
>>>>>>>> {
>>>>>>>>  “_id”: “foo”,
>>>>>>>>  “_rev”: “1-def”,
>>>>>>>>  “owner”: “bob”,
>>>>>>>>  “active”: true
>>>>>>>> }
>>>>>>>> 
>>>>>>>> could be stored thus:
>>>>>>>> 
>>>>>>>> (“foo”, “active”) = true
>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>> (“foo”, “owner”, “1-def”) = “bob”
>>>>>>>> 
>>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting
subtree this
>>>>>>>> representation can handle conflicts of different data types
as well.
>>>>>>>> 
>>>>>>>> Missing fields need to be handled explicitly:
>>>>>>>> 
>>>>>>>> {
>>>>>>>> “_id”: “foo”,
>>>>>>>> “_rev”: “1-abc”,
>>>>>>>> “owner”: “alice”,
>>>>>>>> “active”: true
>>>>>>>> }
>>>>>>>> 
>>>>>>>> {
>>>>>>>> “_id”: “foo”,
>>>>>>>> “_rev”: “1-def”,
>>>>>>>> “owner”: {
>>>>>>>>  “name”: “bob”,
>>>>>>>>  “email”: “
>>>>>>>> bob@example.com
>>>>>>>> "
>>>>>>>> }
>>>>>>>> }
>>>>>>>> 
>>>>>>>> could be stored thus:
>>>>>>>> 
>>>>>>>> (“foo”, “active”) = kCONFLICT
>>>>>>>> (“foo”, “active”, “1-abc”) = true
>>>>>>>> (“foo”, “active”, “1-def”) = kMISSING
>>>>>>>> (“foo”, “owner”) = kCONFLICT
>>>>>>>> (“foo”, “owner”, “1-abc”) = “alice”
>>>>>>>> (“foo”, “owner”, “1-def”, “name”) = “bob”
>>>>>>>> (“foo”, “owner”, “1-def”, “email”) = ...
>>>>>>>> 
>>>>>>>> Revision Metadata
>>>>>>>> 
>>>>>>>> * CouchDB uses a hash history for revisions
>>>>>>>> ** Each edit is identified by the hash of the content of
the edit
>>>>>>>> including the base revision against which it was applied
>>>>>>>> ** Individual edit branches are bounded in length but the
number of
>>>>>>>> branches is potentially unbounded
>>>>>>>> 
>>>>>>>> * Size limits preclude us from storing the entire key tree
as a single
>>>>>>>> value; in pathological situations
>>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes)
>>>>>>>> 
>>>>>>>> * Store each edit branch as a separate KV including deleted
status in a
>>>>>>>> special subspace
>>>>>>>> 
>>>>>>>> * Structure key representation so that “winning” revision
can be
>>>>>>>> automatically retrieved in a limit=1
>>>>>>>> key range operation
>>>>>>>> 
>>>>>>>> (“foo”, “_meta”, “deleted=false”, 1, “def”)
= []
>>>>>>>> (“foo”, “_meta”, “deleted=false”, 4, “bif”)
= [“3-baz”,”2-bar”,”1-foo”]
>>>>>>>> <-- winner
>>>>>>>> (“foo”, “_meta”, “deleted=true”, 3, “abc”)
= [“2-bar”, “1-foo”]
>>>>>>>> 
>>>>>>>> Changes Feed
>>>>>>>> 
>>>>>>>> * FDB supports a concept called a versionstamp — a 10 byte,
unique,
>>>>>>>> monotonically (but not sequentially) increasing value for
each committed
>>>>>>>> transaction. The first 8 bytes are the committed version
of the
>>>>>>>> database. The last 2 bytes are monotonic in the serialization
order for
>>>>>>>> transactions.
>>>>>>>> 
>>>>>>>> * A transaction can specify a particular index into a key
where the
>>>>>>>> following 10 bytes will be overwritten by the versionstamp
at commit
>>>>>>>> time
>>>>>>>> 
>>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes
feed
>>>>>>>> 
>>>>>>>> by_seq subspace
>>>>>>>> (“versionstamp1”) = (“foo”, “1-abc”)
>>>>>>>> (“versionstamp4”) = (“bar”, “4-def”)
>>>>>>>> 
>>>>>>>> by_id subspace
>>>>>>>> (“bar”, “_vsn”) = “versionstamp4”
>>>>>>>> ...
>>>>>>>> (“foo”, “_vsn”) = “versionstamp1”
>>>>>>>> 
>>>>>>>> JSON Indexes
>>>>>>>> 
>>>>>>>> * “Mango” JSON indexes are defined by
>>>>>>>> ** a list of field names, each of which may be nested,
>>>>>>>> ** an optional partial_filter_selector which constrains the
set of docs
>>>>>>>> that contribute
>>>>>>>> ** an optional name defined by the ddoc field (the name is
auto-
>>>>>>>> generated if not supplied)
>>>>>>>> 
>>>>>>>> * Store index definitions in a single subspace to aid query
planning
>>>>>>>> ** ((person,name), title, email) = (“name-title-email”,
“{“student”:
>>>>>>>> true}”)
>>>>>>>> ** Store the values for each index in a dedicated subspace,
adding the
>>>>>>>> document ID as the last element in the tuple
>>>>>>>> *** (“rosie revere”, “engineer”, “rosie@example.com",
“foo”) = null
>>>>>>>> 
>>>>>>>> B.
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Robert Samuel Newson
>>>>>>>> rnewson@apache.org
>>>>>>>> 
>>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
>>>>>>>>> 
>>>>>>>>> I want to fix previous mistakes. I did two mistakes in
previous
>>>>>>>>> calculations:
>>>>>>>>> - I used 1Kb as base size for calculating expansion factor
(although
>>>>>>> we
>>>>>>>>> don't know exact size of original document)
>>>>>>>>> - The expansion factor calculation included number of
revisions (it
>>>>>>>>> shouldn't)
>>>>>>>>> 
>>>>>>>>> I'll focus on flattened JSON docs model
>>>>>>>>> 
>>>>>>>>> The following formula is used in previous calculation.
>>>>>>>>> storage_size_per_document=mapping_table_size*number_of_revisions
+
>>>>>>>>> depth*number_of_paths*number_of_revisions +
>>>>>>>>> number_of_paths*value_size*number_of_revisions
>>>>>>>>> 
>>>>>>>>> To clarify things a little bit I want to calculate space
requirement
>>>>>>> for
>>>>>>>>> single revision this time.
>>>>>>>>> mapping_table_size=field_name_size*(field_name_length+4(integer
>>>>>>>>> size))=100 * (20 + 4(integer size)) = 2400 bytes
>>>>>>>>> storage_size_per_document_per_revision_per_replica=mapping_table_size
>>>>>>> +
>>>>>>>>> depth*number_of_paths + value_size*number_of_paths =
>>>>>>>>> 2400bytes + 10*1000+1000*100=112400bytes~=110 Kb
>>>>>>>>> 
>>>>>>>>> We definitely can reduce requirement for mapping table
by adopting
>>>>>>>>> rnewson's idea of a schema.
>>>>>>>>> 
>>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov <iilyak@apache.org>
wrote:
>>>>>>>>>> Hi Michael,
>>>>>>>>>> 
>>>>>>>>>>> For example, hears a crazy thought:
>>>>>>>>>>> Map every distinct occurence of a key/value instance
through a
>>>>>>> crypto hash
>>>>>>>>>>> function to get a set of hashes.
>>>>>>>>>>> 
>>>>>>>>>>> These can be be precomputed by Couch without
any lookups in FDB.
>>>>>>> These
>>>>>>>>>>> will be spread all over kingdom come in FDB and
not lend
>>>>>>> themselves to
>>>>>>>>>>> range search well.
>>>>>>>>>>> 
>>>>>>>>>>> So what you do is index them for frequency of
occurring in the
>>>>>>> same set.
>>>>>>>>>>> In essence, you 'bucket them' statistically,
and that bucket id
>>>>>>> becomes a
>>>>>>>>>>> key prefix. A crypto hash value can be copied
into more than one
>>>>>>> bucket.
>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
>>>>>>>>>> 
>>>>>>>>>>> When writing a document, Couch submits the list/array
of
>>>>>>> cryptohash values
>>>>>>>>>>> it computed to FDB and gets back the corresponding
 {val_id} (the
>>>>>>> id with
>>>>>>>>>>> the bucket prefixed).  This can get somewhat
expensive if there's
>>>>>>> always a
>>>>>>>>>>> lot of app local cache misses.
>>>>>>>>>>> 
>>>>>>>>>>> A document's value is then a series of {val_id}
arrays up to 100k
>>>>>>> per
>>>>>>>>>>> segment.
>>>>>>>>>>> 
>>>>>>>>>>> When retrieving a document, you get the val_ids,
find the distinct
>>>>>>> buckets
>>>>>>>>>>> and min/max entries for this doc, and then parallel
query each
>>>>>>> bucket while
>>>>>>>>>>> reconstructing the document.
>>>>>>>>>> 
>>>>>>>>>> Interesting idea. Let's try to think it through to
see if we can
>>>>>>> make it viable.
>>>>>>>>>> Let's go through hypothetical example. Input data
for the example:
>>>>>>>>>> - 1M of documents
>>>>>>>>>> - each document is around 10Kb
>>>>>>>>>> - each document consists of 1K of unique JSON paths
>>>>>>>>>> - each document has 100 unique JSON field names
>>>>>>>>>> - every scalar value is 100 bytes
>>>>>>>>>> - 10% of unique JSON paths for every document already
stored in
>>>>>>> database under different doc or different revision of the current
one
>>>>>>>>>> - we assume 3 independent copies for every key-value
pair in FDB
>>>>>>>>>> - our hash key size is 32 bytes
>>>>>>>>>> - let's assume we can determine if key is already
on the storage
>>>>>>> without doing query
>>>>>>>>>> - 1% of paths is in cache (unrealistic value, in
real live the
>>>>>>> percentage is lower)
>>>>>>>>>> - every JSON field name is 20 bytes
>>>>>>>>>> - every JSON path is 10 levels deep
>>>>>>>>>> - document key prefix length is 50
>>>>>>>>>> - every document has 10 revisions
>>>>>>>>>> Let's estimate the storage requirements and size
of data we need to
>>>>>>> transmit. The calculations are not exact.
>>>>>>>>>> 1. storage_size_per_document (we cannot estimate
exact numbers since
>>>>>>> we don't know how FDB stores it)
>>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%))
* 32 bytes) =
>>>>>>> 38Kb * 10 * 3 = 1140 Kb (11x)
>>>>>>>>>> 2. number of independent keys to retrieve on document
read
>>>>>>> (non-range queries) per document
>>>>>>>>>> - 1K - (1K * 1%) = 990
>>>>>>>>>> 3. number of range queries: 0
>>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100
bytes + 32
>>>>>>> bytes) = 102 Kb (10x)
>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers
from
>>>>>>> https://apple.github.io/foundationdb/performance.html)
>>>>>>>>>>  - sequential: 990*2ms = 1980ms
>>>>>>>>>>  - range: 0
>>>>>>>>>> Let's compare these numbers with initial proposal
(flattened JSON
>>>>>>> docs without global schema and without cache)
>>>>>>>>>> 1. storage_size_per_document
>>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size))
= 2400 bytes
>>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 = 100
bytes
>>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10
+ 1K*100*10 =
>>>>>>> 2024K = 1976 Kb * 3 = 5930 Kb (59.3x)
>>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending
on index
>>>>>>> structure)
>>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result)
>>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100
= 23.6 Kb
>>>>>>> (2.4x)
>>>>>>>>>> 5. read latency (we use 2ms per read based on numbers
from
>>>>>>> https://apple.github.io/foundationdb/performance.html and estimate
range
>>>>>>> read performance based on numbers from
>>>>>>> https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test
>>>>>>> )
>>>>>>>>>> - range read performance: Given read performance
is about 305,000
>>>>>>> reads/second and range performance 3,600,000 keys/second we estimate
range
>>>>>>> performance to be 11.8x compared to read performance. If read
performance
>>>>>>> is 2ms than range performance is 0.169ms (which is hard to believe).
>>>>>>>>>> - sequential: 2 * 2 = 4ms
>>>>>>>>>> - range: 0.169
>>>>>>>>>> 
>>>>>>>>>> It looks like we are dealing with a tradeoff:
>>>>>>>>>> - Map every distinct occurrence of a key/value instance
through a
>>>>>>> crypto hash:
>>>>>>>>>> - 5.39x more disk space efficient
>>>>>>>>>> - 474x slower
>>>>>>>>>> - flattened JSON model
>>>>>>>>>> - 5.39x less efficient in disk space
>>>>>>>>>> - 474x faster
>>>>>>>>>> 
>>>>>>>>>> In any case this unscientific exercise was very helpful.
Since it
>>>>>>> uncovered the high cost in terms of disk space. 59.3x of original
disk size
>>>>>>> is too much IMO.
>>>>>>>>>> 
>>>>>>>>>> Are the any ways we can make Michael's model more
performant?
>>>>>>>>>> 
>>>>>>>>>> Also I don't quite understand few aspects of the
global hash table
>>>>>>> proposal:
>>>>>>>>>> 
>>>>>>>>>> 1. > - Map every distinct occurence of a key/value
instance through
>>>>>>> a crypto hash function to get a set of hashes.
>>>>>>>>>> I think we are talking only about scalar values here?
I.e.
>>>>>>> `"#/foo.bar.baz": 123`
>>>>>>>>>> Since I don't know how we can make it work for all
possible JSON
>>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}":
>>>>>>>>>> - foo
>>>>>>>>>> - foo.bar
>>>>>>>>>> - foo.bar.baz
>>>>>>>>>> 
>>>>>>>>>> 2. how to delete documents
>>>>>>>>>> 
>>>>>>>>>> Best regards,
>>>>>>>>>> ILYA
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair <michael@daclubhouse.net>
>>>>>>> wrote:
>>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski
<kocolosk@apache.org
>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Michael,
>>>>>>>>>>>> 
>>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID
as DOC_KEY.
>>>>>>>>>>>> 
>>>>>>>>>>>> Yes that’s definitely one way to address
storage of edit
>>>>>>> conflicts. I
>>>>>>>>>>>> think there are other, more compact representations
that we can
>>>>>>> explore if
>>>>>>>>>>>> we have this “exploded” data model where
each scalar value maps
>>>>>>> to an
>>>>>>>>>>>> individual KV pair.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I agree, as I mentioned on the original thread,
I see a scheme,
>>>>>>> that
>>>>>>>>>>> handles both conflicts and revisions, where you
only have to store
>>>>>>> the most
>>>>>>>>>>> recent change to a field.  Like you suggested,
multiple revisions
>>>>>>> can share
>>>>>>>>>>> a key.  Which in my mind's eye further begs the
conflicts/revisions
>>>>>>>>>>> discussion along with the working within the
limits discussion
>>>>>>> because it
>>>>>>>>>>> seems to me they are all intrinsically related
as a "feature".
>>>>>>>>>>> 
>>>>>>>>>>> Saying 'We'll break documents up into roughly
80k segments', then
>>>>>>> trying to
>>>>>>>>>>> overlay some kind of field sharing scheme for
revisions/conflicts
>>>>>>> doesn't
>>>>>>>>>>> seem like it will work.
>>>>>>>>>>> 
>>>>>>>>>>> I probably should have left out the trivial fix
proposal as I
>>>>>>> don't think
>>>>>>>>>>> it's a feasible solution to actually use.
>>>>>>>>>>> 
>>>>>>>>>>> The comment is more regarding that I do not see
how this thread
>>>>>>> can escape
>>>>>>>>>>> including how to store/retrieve conflicts/revisions.
>>>>>>>>>>> 
>>>>>>>>>>> For instance, the 'doc as individual fields'
proposal lends itself
>>>>>>> to value
>>>>>>>>>>> sharing across mutiple documents (and I don't
just mean revisions
>>>>>>> of the
>>>>>>>>>>> same doc, I mean the same key/value instance
could be shared for
>>>>>>> every
>>>>>>>>>>> document).
>>>>>>>>>>> However that's not really relevant if we're not
considering the
>>>>>>> amount of
>>>>>>>>>>> shared information across documents in the storage
scheme.
>>>>>>>>>>> 
>>>>>>>>>>> Simply storing documents in <100k segments
(perhaps in some kind of
>>>>>>>>>>> compressed binary representation) to deal with
that FDB limit
>>>>>>> seems fine.
>>>>>>>>>>> The only reason to consider doing something else
is because of its
>>>>>>> impact
>>>>>>>>>>> to indexing, searches, reduce functions, revisions,
on-disk size
>>>>>>> impact,
>>>>>>>>>>> etc.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>>> I'm assuming the process will flatten
the key paths of the
>>>>>>> document into
>>>>>>>>>>>> an array and then request the value of each
key as multiple
>>>>>>> parallel
>>>>>>>>>>>> queries against FDB at once
>>>>>>>>>>>> 
>>>>>>>>>>>> Ah, I think this is not one of Ilya’s assumptions.
He’s trying
>>>>>>> to design a
>>>>>>>>>>>> model which allows the retrieval of a document
with a single
>>>>>>> range read,
>>>>>>>>>>>> which is a good goal in my opinion.
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I am not sure I agree.
>>>>>>>>>>> 
>>>>>>>>>>> Think of bitTorrent, a single range read should
pull back the
>>>>>>> structure of
>>>>>>>>>>> the document (the pieces to fetch), but not necessarily
the whole
>>>>>>> document.
>>>>>>>>>>> 
>>>>>>>>>>> What if you already have a bunch of pieces in
common with other
>>>>>>> documents
>>>>>>>>>>> locally (a repeated header/footer/ or type for
example); and you
>>>>>>> only need
>>>>>>>>>>> to get a few pieces of data you don't already
have?
>>>>>>>>>>> 
>>>>>>>>>>> The real goal to Couch I see is to treat your
document set like the
>>>>>>>>>>> collection of structured information that it
is.  In some respects
>>>>>>> like an
>>>>>>>>>>> extension of your application's heap space for
structured objects
>>>>>>> and
>>>>>>>>>>> efficiently querying that collection to get back
subsets of the
>>>>>>> data.
>>>>>>>>>>> 
>>>>>>>>>>> Otherwise it seems more like a slightly upgraded
file system plus
>>>>>>> a fancy
>>>>>>>>>>> grep/find like feature...
>>>>>>>>>>> 
>>>>>>>>>>> The best way I see to unlock more features/power
is to a move
>>>>>>> towards a
>>>>>>>>>>> more granular and efficient way to store and
retrieve the scalar
>>>>>>> values...
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> For example, hears a crazy thought:
>>>>>>>>>>> Map every distinct occurence of a key/value instance
through a
>>>>>>> crypto hash
>>>>>>>>>>> function to get a set of hashes.
>>>>>>>>>>> 
>>>>>>>>>>> These can be be precomputed by Couch without
any lookups in FDB.
>>>>>>> These
>>>>>>>>>>> will be spread all over kingdom come in FDB and
not lend
>>>>>>> themselves to
>>>>>>>>>>> range search well.
>>>>>>>>>>> 
>>>>>>>>>>> So what you do is index them for frequency of
occurring in the
>>>>>>> same set.
>>>>>>>>>>> In essence, you 'bucket them' statistically,
and that bucket id
>>>>>>> becomes a
>>>>>>>>>>> key prefix. A crypto hash value can be copied
into more than one
>>>>>>> bucket.
>>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id}
>>>>>>>>>>> 
>>>>>>>>>>> When writing a document, Couch submits the list/array
of
>>>>>>> cryptohash values
>>>>>>>>>>> it computed to FDB and gets back the corresponding
 {val_id} (the
>>>>>>> id with
>>>>>>>>>>> the bucket prefixed).  This can get somewhat
expensive if there's
>>>>>>> always a
>>>>>>>>>>> lot of app local cache misses.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> A document's value is then a series of {val_id}
arrays up to 100k
>>>>>>> per
>>>>>>>>>>> segment.
>>>>>>>>>>> 
>>>>>>>>>>> When retrieving a document, you get the val_ids,
find the distinct
>>>>>>> buckets
>>>>>>>>>>> and min/max entries for this doc, and then parallel
query each
>>>>>>> bucket while
>>>>>>>>>>> reconstructing the document.
>>>>>>>>>>> 
>>>>>>>>>>> The values returned from the buckets query are
the key/value
>>>>>>> strings
>>>>>>>>>>> required to reassemble this document.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ----------
>>>>>>>>>>> I put this forward primarily to hilite the idea
that trying to
>>>>>>> match the
>>>>>>>>>>> storage representation of documents in a straight
forward way to
>>>>>>> FDB keys
>>>>>>>>>>> to reduce query count might not be the most performance
oriented
>>>>>>> approach.
>>>>>>>>>>> 
>>>>>>>>>>> I'd much prefer a storage approach that reduced
data duplication
>>>>>>> and
>>>>>>>>>>> enabled fast sub-document queries.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> This clearly falls in the realm of what people
want the 'use case'
>>>>>>> of Couch
>>>>>>>>>>> to be/become.  By giving Couch more access to
sub-document
>>>>>>> queries, I could
>>>>>>>>>>> eventually see queries as complicated as GraphQL
submitted to
>>>>>>> Couch and
>>>>>>>>>>> pulling back ad-hoc aggregated data across multiple
documents in a
>>>>>>> single
>>>>>>>>>>> application layer request.
>>>>>>>>>>> 
>>>>>>>>>>> Hehe - one way to look at the database of Couch
documents is that
>>>>>>> they are
>>>>>>>>>>> all conflict revisions of the single root empty
document.   What I
>>>>>>> mean be
>>>>>>>>>>> this is consider thinking of the entire document
store as one
>>>>>>> giant DAG of
>>>>>>>>>>> key/value pairs. How even separate documents
are still typically
>>>>>>> related to
>>>>>>>>>>> each other.  For most applications there is a
tremendous amount of
>>>>>>> data
>>>>>>>>>>> redundancy between docs and especially between
revisions of those
>>>>>>> docs...
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> And all this is a long way of saying "I think
there could be a lot
>>>>>>> of value
>>>>>>>>>>> in assuming documents are 'assembled' from multiple
queries to
>>>>>>> FDB, with
>>>>>>>>>>> local caching, instead of simply retrieved"
>>>>>>>>>>> 
>>>>>>>>>>> Thanks, I hope I'm not the only outlier here
thinking this way!?
>>>>>>>>>>> 
>>>>>>>>>>> Mike :-)
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 


Mime
View raw message