From dev-return-48327-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org Tue Feb 19 21:39:24 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8C03B18060E for ; Tue, 19 Feb 2019 22:39:22 +0100 (CET) Received: (qmail 49636 invoked by uid 500); 19 Feb 2019 21:39:17 -0000 Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@couchdb.apache.org Delivered-To: mailing list dev@couchdb.apache.org Received: (qmail 49590 invoked by uid 99); 19 Feb 2019 21:39:17 -0000 Received: from mail-relay.apache.org (HELO mailrelay1-lw-us.apache.org) (207.244.88.152) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 19 Feb 2019 21:39:17 +0000 Received: from auth2-smtp.messagingengine.com (auth2-smtp.messagingengine.com [66.111.4.228]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id BF03C11EA for ; Tue, 19 Feb 2019 21:39:16 +0000 (UTC) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailauth.nyi.internal (Postfix) with ESMTP id 2C38B221C6 for ; Tue, 19 Feb 2019 16:39:16 -0500 (EST) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Tue, 19 Feb 2019 16:39:16 -0500 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedutddrtdeggdduheduucdltddurdegtdelrddttd dmucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfquhht necuuegrihhlohhuthemuceftddtnecuogfuuhhsphgvtghtffhomhgrihhnucdlgeelmd enucfjughrpefhtgfgggfuffhfvfgjkffosehtqhhmtdhhtdejnecuhfhrohhmpeftohgs vghrthcuufgrmhhuvghlucfpvgifshhonhcuoehrnhgvfihsohhnsegrphgrtghhvgdroh hrgheqnecuffhomhgrihhnpehgihhthhhusgdrihhopdhgihhthhhusgdrtghomhenucfk phepudekhedrvddvvddrvdejrddvgedunecurfgrrhgrmhepmhgrihhlfhhrohhmpehrnh gvfihsohhnodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdelfeegvddtvdej vddqudduleegjedtjeejqdhrnhgvfihsohhnpeeprghprggthhgvrdhorhhgsehfrghsth hmrghilhdrfhhmnecuvehluhhsthgvrhfuihiivgeptd X-ME-Proxy: Received: from [198.18.15.178] (unknown [185.222.27.241]) by mail.messagingengine.com (Postfix) with ESMTPA id 42622E4210 for ; Tue, 19 Feb 2019 16:39:15 -0500 (EST) From: Robert Samuel Newson Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: Re: [DISCUSS] : things we need to solve/decide : storing JSON documents Date: Tue, 19 Feb 2019 21:39:14 +0000 References: <1549310372.3461518.1650706552.15EAC694@webmail.messagingengine.com> <1550593130.2992522.1660742712.3D28A66C@webmail.messagingengine.com> <68342974-4E60-4DD0-A05C-E4DF221D848F@apache.org> <1550595895.3011401.1660764984.3118E8B0@webmail.messagingengine.com> <7176D3B1-F53B-4650-9483-FA7388DFE210@apache.org> <1550605170.3072070.1660826784.1DBA101F@webmail.messagingengine.com> To: CouchDB Developers In-Reply-To: Message-Id: <011ABCAF-AB2F-43E1-A09A-7E0FC51EC7C5@apache.org> X-Mailer: Apple Mail (2.3445.102.3) Interesting suggestion, obviously the details might get the wrong kind = of fun. Somewhere above I suggested this would be something we could change over = time and even use different approaches for different documents within = the same database. This is the long way of saying there are multiple = ways to do this each with advantages and none without disadvantages. I don=E2=80=99t think adding a layer of abstraction is the right move = just yet, I think we should continue to find consensus on one answer to = this question (and the related ones in other threads) for the first = release. It=E2=80=99s easy to say =E2=80=9Cwe can change it later=E2=80=9D= , of course. We can, though it would be a chunk of work in the context = of something that already works, I=E2=80=99ve rarely seen anyone sign up = for that. I=E2=80=99m fine with the first proposal from Adam, where the keys are = tuples of key parts pointing at terminal values. To make it easier for = the first version, I would exclude optimisations like deduplication or = the Directory aliasing or the schema thing that I suggested and that = Ilya incorporated a variant of in a follow-up post. We=E2=80=99d accept = that there are limits on the sizes of documents, including the = awkward-to-express one about property depth. Stepping back, I=E2=80=99m not seeing any essential improvement over = Adam=E2=80=99s original proposal besides the few corrections and = clarifications made by various authors. Could we start an RFC based on = Adam=E2=80=99s original proposal on document body, revision tree and = index storage? We could then have PR=E2=80=99s against that for each = additional optimisation (one person=E2=80=99s optimisation is another = person=E2=80=99s needless complication)? If I=E2=80=99ve missed some genuine advance on the original proposal in = this long thread, please call it out for me. B. > On 19 Feb 2019, at 21:15, Benjamin Anderson = wrote: >=20 > As is evident by the length of this thread, there's a pretty big > design space to cover here, and it seems unlikely we'll have arrived > at a "correct" solution even by the time this thing ships. Perhaps it > would be worthwhile to treat the in-FDB representation of data as a > first-class abstraction and support multiple representations > simultaneously? >=20 > Obviously there's no such thing as a zero-cost abstraction - and I've > not thought very hard about how far up the stack the document > representation would need to leak - but supporting different layouts > (primarily, as Adam points out, on the document body itself) might > prove interesting and useful. I'm sure there are folks interested in a > column-shaped CouchDB, for example. >=20 > -- > b >=20 > On Tue, Feb 19, 2019 at 11:39 AM Robert Newson = wrote: >>=20 >> Good points on revtree, I agree with you we should store that = intelligently to gain the benefits you mentioned. >>=20 >> -- >> Robert Samuel Newson >> rnewson@apache.org >>=20 >> On Tue, 19 Feb 2019, at 18:41, Adam Kocoloski wrote: >>> I do not think we should store the revtree as a blob. The design = where >>> each edit branch is its own KV should save on network IO and CPU = cycles >>> for normal updates. We=E2=80=99ve performed too many heroics to keep >>> couch_key_tree from stalling entire databases when trying to update = a >>> single document with a wide revision tree, I would much prefer to = ignore >>> other edit branches entirely when all we=E2=80=99re doing is = extending one of >>> them. >>>=20 >>> I also do not think we should store JSON documents as blobs, but = it=E2=80=99s a >>> closer call. Some of my reasoning for preferring the exploded path >>> design: >>>=20 >>> - it lends itself nicely to sub-document operations, for which Jan >>> crafted an RFC last year: = https://github.com/apache/couchdb/issues/1559 >>> - it optimizes the creation of Mango indexes on existing databases = since >>> we only need to retrieve the value(s) we want to index >>> - it optimizes Mango queries that use field selectors >>> - anyone who wanted to try their hand at GraphQL will find it very >>> handy: https://github.com/apache/couchdb/issues/1499 >>> - looking further ahead, it lets us play with smarter leaf value = types >>> like Counters (yes I=E2=80=99m still on the CRDT bandwagon, sorry) >>>=20 >>> A few comments on the thread: >>>=20 >>>>>> * Most documents bodies are probably going to be smaller than = 100k. So in >>>>>> the majority of case it would be one write / one read to update = and fetch >>>>>> the document body. >>>=20 >>> We should test, but I expect reading 50KB of data in a range query = is >>> almost as efficient as reading a single 50 KB value. Similarly, = writes >>> to a contiguous set of keys should be quite efficient. >>>=20 >>> I am concerned about the overhead of the repeated field paths in the >>> keys with the exploded path option in the absence of key prefix >>> compression. That would be my main reason to acquiesce and throw = away >>> all the document structure. >>>=20 >>> Adam >>>=20 >>>> On Feb 19, 2019, at 12:04 PM, Robert Newson = wrote: >>>>=20 >>>> I like the idea that we'd reuse the same pattern (but perhaps not = the same _code_) for doc bodies, revtree and attachments. >>>>=20 >>>> I hope we still get to delete couch_key_tree.erl, though. >>>>=20 >>>> -- >>>> Robert Samuel Newson >>>> rnewson@apache.org >>>>=20 >>>> On Tue, 19 Feb 2019, at 17:03, Jan Lehnardt wrote: >>>>> I like the idea from a =E2=80=9Ctrying a simple thing first=E2=80=9D= perspective, but >>>>> Nick=E2=80=99s points below are especially convincing to with this = for now. >>>>>=20 >>>>> Best >>>>> Jan >>>>> =E2=80=94 >>>>>=20 >>>>>> On 19. Feb 2019, at 17:53, Nick Vatamaniuc = wrote: >>>>>>=20 >>>>>> Hi, >>>>>>=20 >>>>>> Sorry for jumping in so late, I was following from the sidelines = mostly. A >>>>>> lot of good discussion happening and am excited about the = possibilities >>>>>> here. >>>>>>=20 >>>>>> I do like the simpler "chunking" approach for a few reasons: >>>>>>=20 >>>>>> * Most documents bodies are probably going to be smaller than = 100k. So in >>>>>> the majority of case it would be one write / one read to update = and fetch >>>>>> the document body. >>>>>>=20 >>>>>> * We could reuse the chunking code for attachment handling and = possibly >>>>>> revision key trees. So it's the general pattern of upload chunks = to some >>>>>> prefix, and when finished flip an atomic toggle to make it = current. >>>>>>=20 >>>>>> * Do the same thing with revision trees and we could re-use the = revision >>>>>> tree manipulation logic. That is, the key tree in most cases = would be small >>>>>> enough to fit in 100k but if they get huge, they'd get chunked. = This would >>>>>> allow us to reuse all the battle tested couch_key_tree code = mostly as is. >>>>>> We even have property tests for it >>>>>> = https://github.com/apache/couchdb/blob/master/src/couch/test/couch_key_tre= e_prop_tests.erl >>>>>>=20 >>>>>> * It removes the need to explain the max exploded path length = limitation to >>>>>> customers. >>>>>>=20 >>>>>> Cheers, >>>>>> -Nick >>>>>>=20 >>>>>>=20 >>>>>> On Tue, Feb 19, 2019 at 11:18 AM Robert Newson = wrote: >>>>>>=20 >>>>>>> Hi, >>>>>>>=20 >>>>>>> An alternative storage model that we should seriously consider = is to >>>>>>> follow our current approach in couch_file et al. Specifically, = that the >>>>>>> document _body_ is stored as an uninterpreted binary value. This = would be >>>>>>> much like the obvious plan for attachment storage; a key prefix = that >>>>>>> identifies the database and document, with the final item of = that key tuple >>>>>>> is an incrementing integer. Each of those keys has a binary = value of up to >>>>>>> 100k. Fetching all values with that key prefix, in fdb's natural = ordering, >>>>>>> will yield the full document body, which can be JSON decoded for = further >>>>>>> processing. >>>>>>>=20 >>>>>>> I like this idea, and I like Adam's original proposal to explode = documents >>>>>>> into property paths. I have a slight preference for the = simplicity of the >>>>>>> idea in the previous paragraph, not least because it's close to = what we do >>>>>>> today. I also think it will be possible to migrate to = alternative storage >>>>>>> models in future, and foundationdb's transaction supports means = we can do >>>>>>> this migration seamlessly should we come to it. >>>>>>>=20 >>>>>>> I'm very interested in knowing if anyone else is interested in = going this >>>>>>> simple, or considers it a wasted opportunity relative to the = 'exploded' >>>>>>> path. >>>>>>>=20 >>>>>>> B. >>>>>>>=20 >>>>>>> -- >>>>>>> Robert Samuel Newson >>>>>>> rnewson@apache.org >>>>>>>=20 >>>>>>> On Mon, 4 Feb 2019, at 19:59, Robert Newson wrote: >>>>>>>> I've been remiss here in not posting the data model ideas that = IBM >>>>>>>> worked up while we were thinking about using FoundationDB so = I'm posting >>>>>>>> it now. This is Adam' Kocoloski's original work, I am just = transcribing >>>>>>>> it, and this is the context that the folks from the IBM side = came in >>>>>>>> with, for full disclosure. >>>>>>>>=20 >>>>>>>> Basics >>>>>>>>=20 >>>>>>>> 1. All CouchDB databases are inside a Directory >>>>>>>> 2. Each CouchDB database is a Directory within that Directory >>>>>>>> 3. It's possible to list all subdirectories of a Directory, so >>>>>>>> `_all_dbs` is the list of directories from 1. >>>>>>>> 4. Each Directory representing a CouchdB database has several = Subspaces; >>>>>>>> 4a. by_id/ doc subspace: actual document contents >>>>>>>> 4b. by_seq/versionstamp subspace: for the _changes feed >>>>>>>> 4c. index_definitions, indexes, ... >>>>>>>>=20 >>>>>>>> JSON Mapping >>>>>>>>=20 >>>>>>>> A hierarchical JSON object naturally maps to multiple KV pairs = in FDB: >>>>>>>>=20 >>>>>>>> { >>>>>>>> =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D, >>>>>>>> =E2=80=9Cowner=E2=80=9D: =E2=80=9Cbob=E2=80=9D, >>>>>>>> =E2=80=9Cmylist=E2=80=9D: [1,3,5], >>>>>>>> =E2=80=9Cmymap=E2=80=9D: { >>>>>>>> =E2=80=9Cblue=E2=80=9D: =E2=80=9C#0000FF=E2=80=9D, >>>>>>>> =E2=80=9Cred=E2=80=9D: =E2=80=9C#FF0000=E2=80=9D >>>>>>>> } >>>>>>>> } >>>>>>>>=20 >>>>>>>> maps to >>>>>>>>=20 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D =E2=80=9Cbob= =E2=80=9D >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmylist=E2=80=9D, 0) =3D 1 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmylist=E2=80=9D, 1) =3D 3 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmylist=E2=80=9D, 2) =3D 5 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmymap=E2=80=9D, =E2=80=9Cblue=E2= =80=9D) =3D =E2=80=9C#0000FF=E2=80=9D >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmymap=E2=80=9D, =E2=80=9Cred=E2=80= =9D) =3D =E2=80=9C#FF0000=E2=80=9D >>>>>>>>=20 >>>>>>>> NB: this means that the 100KB limit applies to individual leafs = in the >>>>>>>> JSON object, not the entire doc >>>>>>>>=20 >>>>>>>> Edit Conflicts >>>>>>>>=20 >>>>>>>> We need to account for the presence of conflicts in various = levels of >>>>>>>> the doc due to replication. >>>>>>>>=20 >>>>>>>> Proposal is to create a special value indicating that the = subtree below >>>>>>>> our current cursor position is in an unresolvable conflict. = Then add >>>>>>>> additional KV pairs below to describe the conflicting entries. >>>>>>>>=20 >>>>>>>> KV data model allows us to store these efficiently and minimize >>>>>>>> duplication of data: >>>>>>>>=20 >>>>>>>> A document with these two conflicts: >>>>>>>>=20 >>>>>>>> { >>>>>>>> =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D, >>>>>>>> =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-abc=E2=80=9D, >>>>>>>> =E2=80=9Cowner=E2=80=9D: =E2=80=9Calice=E2=80=9D, >>>>>>>> =E2=80=9Cactive=E2=80=9D: true >>>>>>>> } >>>>>>>> { >>>>>>>> =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D, >>>>>>>> =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-def=E2=80=9D, >>>>>>>> =E2=80=9Cowner=E2=80=9D: =E2=80=9Cbob=E2=80=9D, >>>>>>>> =E2=80=9Cactive=E2=80=9D: true >>>>>>>> } >>>>>>>>=20 >>>>>>>> could be stored thus: >>>>>>>>=20 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D) =3D true >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D kCONFLICT >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-abc=E2= =80=9D) =3D =E2=80=9Calice=E2=80=9D >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=E2= =80=9D) =3D =E2=80=9Cbob=E2=80=9D >>>>>>>>=20 >>>>>>>> So long as `kCONFLICT` is set at the top of the conflicting = subtree this >>>>>>>> representation can handle conflicts of different data types as = well. >>>>>>>>=20 >>>>>>>> Missing fields need to be handled explicitly: >>>>>>>>=20 >>>>>>>> { >>>>>>>> =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D, >>>>>>>> =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-abc=E2=80=9D, >>>>>>>> =E2=80=9Cowner=E2=80=9D: =E2=80=9Calice=E2=80=9D, >>>>>>>> =E2=80=9Cactive=E2=80=9D: true >>>>>>>> } >>>>>>>>=20 >>>>>>>> { >>>>>>>> =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D, >>>>>>>> =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-def=E2=80=9D, >>>>>>>> =E2=80=9Cowner=E2=80=9D: { >>>>>>>> =E2=80=9Cname=E2=80=9D: =E2=80=9Cbob=E2=80=9D, >>>>>>>> =E2=80=9Cemail=E2=80=9D: =E2=80=9C >>>>>>>> bob@example.com >>>>>>>> " >>>>>>>> } >>>>>>>> } >>>>>>>>=20 >>>>>>>> could be stored thus: >>>>>>>>=20 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D) =3D kCONFLICT >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D, =E2=80=9C1-abc=E2= =80=9D) =3D true >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D, =E2=80=9C1-def=E2= =80=9D) =3D kMISSING >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D kCONFLICT >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-abc=E2= =80=9D) =3D =E2=80=9Calice=E2=80=9D >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=E2= =80=9D, =E2=80=9Cname=E2=80=9D) =3D =E2=80=9Cbob=E2=80=9D >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=E2= =80=9D, =E2=80=9Cemail=E2=80=9D) =3D ... >>>>>>>>=20 >>>>>>>> Revision Metadata >>>>>>>>=20 >>>>>>>> * CouchDB uses a hash history for revisions >>>>>>>> ** Each edit is identified by the hash of the content of the = edit >>>>>>>> including the base revision against which it was applied >>>>>>>> ** Individual edit branches are bounded in length but the = number of >>>>>>>> branches is potentially unbounded >>>>>>>>=20 >>>>>>>> * Size limits preclude us from storing the entire key tree as a = single >>>>>>>> value; in pathological situations >>>>>>>> the tree could exceed 100KB (each entry is > 16 bytes) >>>>>>>>=20 >>>>>>>> * Store each edit branch as a separate KV including deleted = status in a >>>>>>>> special subspace >>>>>>>>=20 >>>>>>>> * Structure key representation so that =E2=80=9Cwinning=E2=80=9D = revision can be >>>>>>>> automatically retrieved in a limit=3D1 >>>>>>>> key range operation >>>>>>>>=20 >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_meta=E2=80=9D, = =E2=80=9Cdeleted=3Dfalse=E2=80=9D, 1, =E2=80=9Cdef=E2=80=9D) =3D [] >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_meta=E2=80=9D, = =E2=80=9Cdeleted=3Dfalse=E2=80=9D, 4, =E2=80=9Cbif=E2=80=9D) =3D = [=E2=80=9C3-baz=E2=80=9D,=E2=80=9D2-bar=E2=80=9D,=E2=80=9D1-foo=E2=80=9D] >>>>>>>> <-- winner >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_meta=E2=80=9D, = =E2=80=9Cdeleted=3Dtrue=E2=80=9D, 3, =E2=80=9Cabc=E2=80=9D) =3D = [=E2=80=9C2-bar=E2=80=9D, =E2=80=9C1-foo=E2=80=9D] >>>>>>>>=20 >>>>>>>> Changes Feed >>>>>>>>=20 >>>>>>>> * FDB supports a concept called a versionstamp =E2=80=94 a 10 = byte, unique, >>>>>>>> monotonically (but not sequentially) increasing value for each = committed >>>>>>>> transaction. The first 8 bytes are the committed version of the >>>>>>>> database. The last 2 bytes are monotonic in the serialization = order for >>>>>>>> transactions. >>>>>>>>=20 >>>>>>>> * A transaction can specify a particular index into a key where = the >>>>>>>> following 10 bytes will be overwritten by the versionstamp at = commit >>>>>>>> time >>>>>>>>=20 >>>>>>>> * A subspace keyed on versionstamp naturally yields a _changes = feed >>>>>>>>=20 >>>>>>>> by_seq subspace >>>>>>>> (=E2=80=9Cversionstamp1=E2=80=9D) =3D (=E2=80=9Cfoo=E2=80=9D, = =E2=80=9C1-abc=E2=80=9D) >>>>>>>> (=E2=80=9Cversionstamp4=E2=80=9D) =3D (=E2=80=9Cbar=E2=80=9D, = =E2=80=9C4-def=E2=80=9D) >>>>>>>>=20 >>>>>>>> by_id subspace >>>>>>>> (=E2=80=9Cbar=E2=80=9D, =E2=80=9C_vsn=E2=80=9D) =3D = =E2=80=9Cversionstamp4=E2=80=9D >>>>>>>> ... >>>>>>>> (=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_vsn=E2=80=9D) =3D = =E2=80=9Cversionstamp1=E2=80=9D >>>>>>>>=20 >>>>>>>> JSON Indexes >>>>>>>>=20 >>>>>>>> * =E2=80=9CMango=E2=80=9D JSON indexes are defined by >>>>>>>> ** a list of field names, each of which may be nested, >>>>>>>> ** an optional partial_filter_selector which constrains the set = of docs >>>>>>>> that contribute >>>>>>>> ** an optional name defined by the ddoc field (the name is = auto- >>>>>>>> generated if not supplied) >>>>>>>>=20 >>>>>>>> * Store index definitions in a single subspace to aid query = planning >>>>>>>> ** ((person,name), title, email) =3D (=E2=80=9Cname-title-email=E2= =80=9D, =E2=80=9C{=E2=80=9Cstudent=E2=80=9D: >>>>>>>> true}=E2=80=9D) >>>>>>>> ** Store the values for each index in a dedicated subspace, = adding the >>>>>>>> document ID as the last element in the tuple >>>>>>>> *** (=E2=80=9Crosie revere=E2=80=9D, =E2=80=9Cengineer=E2=80=9D, = =E2=80=9Crosie@example.com", =E2=80=9Cfoo=E2=80=9D) =3D null >>>>>>>>=20 >>>>>>>> B. >>>>>>>>=20 >>>>>>>> -- >>>>>>>> Robert Samuel Newson >>>>>>>> rnewson@apache.org >>>>>>>>=20 >>>>>>>> On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote: >>>>>>>>>=20 >>>>>>>>> I want to fix previous mistakes. I did two mistakes in = previous >>>>>>>>> calculations: >>>>>>>>> - I used 1Kb as base size for calculating expansion factor = (although >>>>>>> we >>>>>>>>> don't know exact size of original document) >>>>>>>>> - The expansion factor calculation included number of = revisions (it >>>>>>>>> shouldn't) >>>>>>>>>=20 >>>>>>>>> I'll focus on flattened JSON docs model >>>>>>>>>=20 >>>>>>>>> The following formula is used in previous calculation. >>>>>>>>> = storage_size_per_document=3Dmapping_table_size*number_of_revisions + >>>>>>>>> depth*number_of_paths*number_of_revisions + >>>>>>>>> number_of_paths*value_size*number_of_revisions >>>>>>>>>=20 >>>>>>>>> To clarify things a little bit I want to calculate space = requirement >>>>>>> for >>>>>>>>> single revision this time. >>>>>>>>> = mapping_table_size=3Dfield_name_size*(field_name_length+4(integer >>>>>>>>> size))=3D100 * (20 + 4(integer size)) =3D 2400 bytes >>>>>>>>> = storage_size_per_document_per_revision_per_replica=3Dmapping_table_size >>>>>>> + >>>>>>>>> depth*number_of_paths + value_size*number_of_paths =3D >>>>>>>>> 2400bytes + 10*1000+1000*100=3D112400bytes~=3D110 Kb >>>>>>>>>=20 >>>>>>>>> We definitely can reduce requirement for mapping table by = adopting >>>>>>>>> rnewson's idea of a schema. >>>>>>>>>=20 >>>>>>>>> On 2019/02/04 11:08:16, Ilya Khlopotov = wrote: >>>>>>>>>> Hi Michael, >>>>>>>>>>=20 >>>>>>>>>>> For example, hears a crazy thought: >>>>>>>>>>> Map every distinct occurence of a key/value instance through = a >>>>>>> crypto hash >>>>>>>>>>> function to get a set of hashes. >>>>>>>>>>>=20 >>>>>>>>>>> These can be be precomputed by Couch without any lookups in = FDB. >>>>>>> These >>>>>>>>>>> will be spread all over kingdom come in FDB and not lend >>>>>>> themselves to >>>>>>>>>>> range search well. >>>>>>>>>>>=20 >>>>>>>>>>> So what you do is index them for frequency of occurring in = the >>>>>>> same set. >>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket = id >>>>>>> becomes a >>>>>>>>>>> key prefix. A crypto hash value can be copied into more than = one >>>>>>> bucket. >>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} >>>>>>>>>>=20 >>>>>>>>>>> When writing a document, Couch submits the list/array of >>>>>>> cryptohash values >>>>>>>>>>> it computed to FDB and gets back the corresponding {val_id} = (the >>>>>>> id with >>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if = there's >>>>>>> always a >>>>>>>>>>> lot of app local cache misses. >>>>>>>>>>>=20 >>>>>>>>>>> A document's value is then a series of {val_id} arrays up to = 100k >>>>>>> per >>>>>>>>>>> segment. >>>>>>>>>>>=20 >>>>>>>>>>> When retrieving a document, you get the val_ids, find the = distinct >>>>>>> buckets >>>>>>>>>>> and min/max entries for this doc, and then parallel query = each >>>>>>> bucket while >>>>>>>>>>> reconstructing the document. >>>>>>>>>>=20 >>>>>>>>>> Interesting idea. Let's try to think it through to see if we = can >>>>>>> make it viable. >>>>>>>>>> Let's go through hypothetical example. Input data for the = example: >>>>>>>>>> - 1M of documents >>>>>>>>>> - each document is around 10Kb >>>>>>>>>> - each document consists of 1K of unique JSON paths >>>>>>>>>> - each document has 100 unique JSON field names >>>>>>>>>> - every scalar value is 100 bytes >>>>>>>>>> - 10% of unique JSON paths for every document already stored = in >>>>>>> database under different doc or different revision of the = current one >>>>>>>>>> - we assume 3 independent copies for every key-value pair in = FDB >>>>>>>>>> - our hash key size is 32 bytes >>>>>>>>>> - let's assume we can determine if key is already on the = storage >>>>>>> without doing query >>>>>>>>>> - 1% of paths is in cache (unrealistic value, in real live = the >>>>>>> percentage is lower) >>>>>>>>>> - every JSON field name is 20 bytes >>>>>>>>>> - every JSON path is 10 levels deep >>>>>>>>>> - document key prefix length is 50 >>>>>>>>>> - every document has 10 revisions >>>>>>>>>> Let's estimate the storage requirements and size of data we = need to >>>>>>> transmit. The calculations are not exact. >>>>>>>>>> 1. storage_size_per_document (we cannot estimate exact = numbers since >>>>>>> we don't know how FDB stores it) >>>>>>>>>> - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = =3D >>>>>>> 38Kb * 10 * 3 =3D 1140 Kb (11x) >>>>>>>>>> 2. number of independent keys to retrieve on document read >>>>>>> (non-range queries) per document >>>>>>>>>> - 1K - (1K * 1%) =3D 990 >>>>>>>>>> 3. number of range queries: 0 >>>>>>>>>> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + = 32 >>>>>>> bytes) =3D 102 Kb (10x) >>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from >>>>>>> https://apple.github.io/foundationdb/performance.html) >>>>>>>>>> - sequential: 990*2ms =3D 1980ms >>>>>>>>>> - range: 0 >>>>>>>>>> Let's compare these numbers with initial proposal (flattened = JSON >>>>>>> docs without global schema and without cache) >>>>>>>>>> 1. storage_size_per_document >>>>>>>>>> - mapping table size: 100 * (20 + 4(integer size)) =3D 2400 = bytes >>>>>>>>>> - key size: (10 * (4 + 1(delimiter))) + 50 =3D 100 bytes >>>>>>>>>> - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = =3D >>>>>>> 2024K =3D 1976 Kb * 3 =3D 5930 Kb (59.3x) >>>>>>>>>> 2. number of independent keys to retrieve: 0-2 (depending on = index >>>>>>> structure) >>>>>>>>>> 3. number of range queries: 1 (1001 of keys in result) >>>>>>>>>> 4. data to transmit on read: 24K + 1000*100 + 1000*100 =3D = 23.6 Kb >>>>>>> (2.4x) >>>>>>>>>> 5. read latency (we use 2ms per read based on numbers from >>>>>>> https://apple.github.io/foundationdb/performance.html and = estimate range >>>>>>> read performance based on numbers from >>>>>>> = https://apple.github.io/foundationdb/benchmarking.html#single-core-read-te= st >>>>>>> ) >>>>>>>>>> - range read performance: Given read performance is about = 305,000 >>>>>>> reads/second and range performance 3,600,000 keys/second we = estimate range >>>>>>> performance to be 11.8x compared to read performance. If read = performance >>>>>>> is 2ms than range performance is 0.169ms (which is hard to = believe). >>>>>>>>>> - sequential: 2 * 2 =3D 4ms >>>>>>>>>> - range: 0.169 >>>>>>>>>>=20 >>>>>>>>>> It looks like we are dealing with a tradeoff: >>>>>>>>>> - Map every distinct occurrence of a key/value instance = through a >>>>>>> crypto hash: >>>>>>>>>> - 5.39x more disk space efficient >>>>>>>>>> - 474x slower >>>>>>>>>> - flattened JSON model >>>>>>>>>> - 5.39x less efficient in disk space >>>>>>>>>> - 474x faster >>>>>>>>>>=20 >>>>>>>>>> In any case this unscientific exercise was very helpful. = Since it >>>>>>> uncovered the high cost in terms of disk space. 59.3x of = original disk size >>>>>>> is too much IMO. >>>>>>>>>>=20 >>>>>>>>>> Are the any ways we can make Michael's model more performant? >>>>>>>>>>=20 >>>>>>>>>> Also I don't quite understand few aspects of the global hash = table >>>>>>> proposal: >>>>>>>>>>=20 >>>>>>>>>> 1. > - Map every distinct occurence of a key/value instance = through >>>>>>> a crypto hash function to get a set of hashes. >>>>>>>>>> I think we are talking only about scalar values here? I.e. >>>>>>> `"#/foo.bar.baz": 123` >>>>>>>>>> Since I don't know how we can make it work for all possible = JSON >>>>>>> paths `{"foo": {"bar": {"size": 12, "baz": 123}}}": >>>>>>>>>> - foo >>>>>>>>>> - foo.bar >>>>>>>>>> - foo.bar.baz >>>>>>>>>>=20 >>>>>>>>>> 2. how to delete documents >>>>>>>>>>=20 >>>>>>>>>> Best regards, >>>>>>>>>> ILYA >>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> On 2019/01/30 23:33:22, Michael Fair = >>>>>>> wrote: >>>>>>>>>>> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski = >>>>>> wrote: >>>>>>>>>>>=20 >>>>>>>>>>>> Hi Michael, >>>>>>>>>>>>=20 >>>>>>>>>>>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY. >>>>>>>>>>>>=20 >>>>>>>>>>>> Yes that=E2=80=99s definitely one way to address storage of = edit >>>>>>> conflicts. I >>>>>>>>>>>> think there are other, more compact representations that we = can >>>>>>> explore if >>>>>>>>>>>> we have this =E2=80=9Cexploded=E2=80=9D data model where = each scalar value maps >>>>>>> to an >>>>>>>>>>>> individual KV pair. >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> I agree, as I mentioned on the original thread, I see a = scheme, >>>>>>> that >>>>>>>>>>> handles both conflicts and revisions, where you only have to = store >>>>>>> the most >>>>>>>>>>> recent change to a field. Like you suggested, multiple = revisions >>>>>>> can share >>>>>>>>>>> a key. Which in my mind's eye further begs the = conflicts/revisions >>>>>>>>>>> discussion along with the working within the limits = discussion >>>>>>> because it >>>>>>>>>>> seems to me they are all intrinsically related as a = "feature". >>>>>>>>>>>=20 >>>>>>>>>>> Saying 'We'll break documents up into roughly 80k segments', = then >>>>>>> trying to >>>>>>>>>>> overlay some kind of field sharing scheme for = revisions/conflicts >>>>>>> doesn't >>>>>>>>>>> seem like it will work. >>>>>>>>>>>=20 >>>>>>>>>>> I probably should have left out the trivial fix proposal as = I >>>>>>> don't think >>>>>>>>>>> it's a feasible solution to actually use. >>>>>>>>>>>=20 >>>>>>>>>>> The comment is more regarding that I do not see how this = thread >>>>>>> can escape >>>>>>>>>>> including how to store/retrieve conflicts/revisions. >>>>>>>>>>>=20 >>>>>>>>>>> For instance, the 'doc as individual fields' proposal lends = itself >>>>>>> to value >>>>>>>>>>> sharing across mutiple documents (and I don't just mean = revisions >>>>>>> of the >>>>>>>>>>> same doc, I mean the same key/value instance could be shared = for >>>>>>> every >>>>>>>>>>> document). >>>>>>>>>>> However that's not really relevant if we're not considering = the >>>>>>> amount of >>>>>>>>>>> shared information across documents in the storage scheme. >>>>>>>>>>>=20 >>>>>>>>>>> Simply storing documents in <100k segments (perhaps in some = kind of >>>>>>>>>>> compressed binary representation) to deal with that FDB = limit >>>>>>> seems fine. >>>>>>>>>>> The only reason to consider doing something else is because = of its >>>>>>> impact >>>>>>>>>>> to indexing, searches, reduce functions, revisions, on-disk = size >>>>>>> impact, >>>>>>>>>>> etc. >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>>> I'm assuming the process will flatten the key paths of the >>>>>>> document into >>>>>>>>>>>> an array and then request the value of each key as multiple >>>>>>> parallel >>>>>>>>>>>> queries against FDB at once >>>>>>>>>>>>=20 >>>>>>>>>>>> Ah, I think this is not one of Ilya=E2=80=99s assumptions. = He=E2=80=99s trying >>>>>>> to design a >>>>>>>>>>>> model which allows the retrieval of a document with a = single >>>>>>> range read, >>>>>>>>>>>> which is a good goal in my opinion. >>>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> I am not sure I agree. >>>>>>>>>>>=20 >>>>>>>>>>> Think of bitTorrent, a single range read should pull back = the >>>>>>> structure of >>>>>>>>>>> the document (the pieces to fetch), but not necessarily the = whole >>>>>>> document. >>>>>>>>>>>=20 >>>>>>>>>>> What if you already have a bunch of pieces in common with = other >>>>>>> documents >>>>>>>>>>> locally (a repeated header/footer/ or type for example); and = you >>>>>>> only need >>>>>>>>>>> to get a few pieces of data you don't already have? >>>>>>>>>>>=20 >>>>>>>>>>> The real goal to Couch I see is to treat your document set = like the >>>>>>>>>>> collection of structured information that it is. In some = respects >>>>>>> like an >>>>>>>>>>> extension of your application's heap space for structured = objects >>>>>>> and >>>>>>>>>>> efficiently querying that collection to get back subsets of = the >>>>>>> data. >>>>>>>>>>>=20 >>>>>>>>>>> Otherwise it seems more like a slightly upgraded file system = plus >>>>>>> a fancy >>>>>>>>>>> grep/find like feature... >>>>>>>>>>>=20 >>>>>>>>>>> The best way I see to unlock more features/power is to a = move >>>>>>> towards a >>>>>>>>>>> more granular and efficient way to store and retrieve the = scalar >>>>>>> values... >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> For example, hears a crazy thought: >>>>>>>>>>> Map every distinct occurence of a key/value instance through = a >>>>>>> crypto hash >>>>>>>>>>> function to get a set of hashes. >>>>>>>>>>>=20 >>>>>>>>>>> These can be be precomputed by Couch without any lookups in = FDB. >>>>>>> These >>>>>>>>>>> will be spread all over kingdom come in FDB and not lend >>>>>>> themselves to >>>>>>>>>>> range search well. >>>>>>>>>>>=20 >>>>>>>>>>> So what you do is index them for frequency of occurring in = the >>>>>>> same set. >>>>>>>>>>> In essence, you 'bucket them' statistically, and that bucket = id >>>>>>> becomes a >>>>>>>>>>> key prefix. A crypto hash value can be copied into more than = one >>>>>>> bucket. >>>>>>>>>>> The {bucket_id}/{cryptohash} becomes a {val_id} >>>>>>>>>>>=20 >>>>>>>>>>> When writing a document, Couch submits the list/array of >>>>>>> cryptohash values >>>>>>>>>>> it computed to FDB and gets back the corresponding {val_id} = (the >>>>>>> id with >>>>>>>>>>> the bucket prefixed). This can get somewhat expensive if = there's >>>>>>> always a >>>>>>>>>>> lot of app local cache misses. >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> A document's value is then a series of {val_id} arrays up to = 100k >>>>>>> per >>>>>>>>>>> segment. >>>>>>>>>>>=20 >>>>>>>>>>> When retrieving a document, you get the val_ids, find the = distinct >>>>>>> buckets >>>>>>>>>>> and min/max entries for this doc, and then parallel query = each >>>>>>> bucket while >>>>>>>>>>> reconstructing the document. >>>>>>>>>>>=20 >>>>>>>>>>> The values returned from the buckets query are the key/value >>>>>>> strings >>>>>>>>>>> required to reassemble this document. >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> ---------- >>>>>>>>>>> I put this forward primarily to hilite the idea that trying = to >>>>>>> match the >>>>>>>>>>> storage representation of documents in a straight forward = way to >>>>>>> FDB keys >>>>>>>>>>> to reduce query count might not be the most performance = oriented >>>>>>> approach. >>>>>>>>>>>=20 >>>>>>>>>>> I'd much prefer a storage approach that reduced data = duplication >>>>>>> and >>>>>>>>>>> enabled fast sub-document queries. >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> This clearly falls in the realm of what people want the 'use = case' >>>>>>> of Couch >>>>>>>>>>> to be/become. By giving Couch more access to sub-document >>>>>>> queries, I could >>>>>>>>>>> eventually see queries as complicated as GraphQL submitted = to >>>>>>> Couch and >>>>>>>>>>> pulling back ad-hoc aggregated data across multiple = documents in a >>>>>>> single >>>>>>>>>>> application layer request. >>>>>>>>>>>=20 >>>>>>>>>>> Hehe - one way to look at the database of Couch documents is = that >>>>>>> they are >>>>>>>>>>> all conflict revisions of the single root empty document. = What I >>>>>>> mean be >>>>>>>>>>> this is consider thinking of the entire document store as = one >>>>>>> giant DAG of >>>>>>>>>>> key/value pairs. How even separate documents are still = typically >>>>>>> related to >>>>>>>>>>> each other. For most applications there is a tremendous = amount of >>>>>>> data >>>>>>>>>>> redundancy between docs and especially between revisions of = those >>>>>>> docs... >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> And all this is a long way of saying "I think there could be = a lot >>>>>>> of value >>>>>>>>>>> in assuming documents are 'assembled' from multiple queries = to >>>>>>> FDB, with >>>>>>>>>>> local caching, instead of simply retrieved" >>>>>>>>>>>=20 >>>>>>>>>>> Thanks, I hope I'm not the only outlier here thinking this = way!? >>>>>>>>>>>=20 >>>>>>>>>>> Mike :-) >>>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>=20 >>>>>=20 >>>=20