From dev-return-48236-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org  Mon Feb  4 19:59:36 2019
Return-Path: <dev-return-48236-archive-asf-public=cust-asf.ponee.io@couchdb.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 82CE9180651
	for <archive-asf-public@cust-asf.ponee.io>; Mon,  4 Feb 2019 20:59:35 +0100 (CET)
Received: (qmail 83655 invoked by uid 500); 4 Feb 2019 19:59:34 -0000
Mailing-List: contact dev-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@couchdb.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@couchdb.apache.org>
List-Post: <mailto:dev@couchdb.apache.org>
List-Id: <dev.couchdb.apache.org>
Reply-To: dev@couchdb.apache.org
Delivered-To: mailing list dev@couchdb.apache.org
Received: (qmail 83644 invoked by uid 99); 4 Feb 2019 19:59:34 -0000
Received: from mail-relay.apache.org (HELO mailrelay2-lw-us.apache.org) (207.244.88.137)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Feb 2019 19:59:34 +0000
Received: from auth2-smtp.messagingengine.com (auth2-smtp.messagingengine.com [66.111.4.228])
	by mailrelay2-lw-us.apache.org (ASF Mail Server at mailrelay2-lw-us.apache.org) with ESMTPSA id AF4A12ED1
	for <dev@couchdb.apache.org>; Mon,  4 Feb 2019 19:59:33 +0000 (UTC)
Received: from compute4.internal (compute4.nyi.internal [10.202.2.44])
	by mailauth.nyi.internal (Postfix) with ESMTP id 61298202F2
	for <dev@couchdb.apache.org>; Mon,  4 Feb 2019 14:59:33 -0500 (EST)
Received: from web4 ([10.202.2.214])
  by compute4.internal (MEProxy); Mon, 04 Feb 2019 14:59:33 -0500
X-ME-Sender: <xms:pZlYXH0wFAjquxi8g7ay1zMgZ-2O-smgiHKBVVcjAxhBMQn-pnHFFg>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedtledrkeeggddufedvucetufdoteggodetrfdotf
    fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfquhhtnecuuegrihhlohhuthemucef
    tddtnecuogfuuhhsphgvtghtffhomhgrihhnucdlgeelmdenucfjughrpefkhffvggfgtg
    fojghfufffsehtqhertdertdejnecuhfhrohhmpeftohgsvghrthcupfgvfihsohhnuceo
    rhhnvgifshhonhesrghprggthhgvrdhorhhgqeenucffohhmrghinhepghhithhhuhgsrd
    hiohenucfrrghrrghmpehmrghilhhfrhhomheprhhnvgifshhonhdomhgvshhmthhprghu
    thhhphgvrhhsohhnrghlihhthidqleefgedvtddvjedvqdduudelgeejtdejjedqrhhnvg
    ifshhonheppegrphgrtghhvgdrohhrghesfhgrshhtmhgrihhlrdhfmhenucevlhhushht
    vghrufhiiigvpedt
X-ME-Proxy: <xmx:pZlYXBu55j4DyF81OH0ZVKcSMu5_gQltgTxxyp1X7F46DmFRjSUGwA>
    <xmx:pZlYXKCpFsFgrOZ3GfsN04H_WKDRmgM3JqToC4VE-7oSk5KunAjutA>
    <xmx:pZlYXLNJDGI4pE9odYepvUn62wrZgd_SAr_uqM12KQtZ4uA39oHxLQ>
    <xmx:pZlYXAMOfHP9irF8nfwzB4huu99xpj7sqAEtfwmPJF2-fAvlLzYEOQ>
Received: by mailuser.nyi.internal (Postfix, from userid 99)
	id 0099FBA7E6; Mon,  4 Feb 2019 14:59:32 -0500 (EST)
Message-Id: <1549310372.3461518.1650706552.15EAC694@webmail.messagingengine.com>
From: Robert Newson <rnewson@apache.org>
To: dev@couchdb.apache.org
MIME-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="utf-8"
X-Mailer: MessagingEngine.com Webmail Interface - ajax-ec01da05
In-Reply-To: <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-1315f77bf33475ce531079cadb0d37cda73903af@dev.couchdb.apache.org>
References: <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-119855293708475a6305d4eb6d52e2da000e369f@dev.couchdb.apache.org>
 <CADC7mDJrA90tt6v=FDZcBaf=__rYReHeaMmCVfKbTRW=4XracA@mail.gmail.com>
 <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-01ef9d89df77cfce6562a4683a5b68678f8f3612@dev.couchdb.apache.org>
 <CADC7mDLiHqODeBac6H-6OWNp=eQb5kW2O=AbER25-6ZK6ePBXw@mail.gmail.com>
 <CADC7mDLpKUpsjzZdj7ALgfW2Gu80ff-k5+y50FZPu+n-UBxByQ@mail.gmail.com>
 <E47717D7-460F-4520-AD9F-1C444DA2B547@apache.org>
 <pony-3fd039c72976f9f89629fb5b95d0a929c183add0-1315f77bf33475ce531079cadb0d37cda73903af@dev.couchdb.apache.org>
Subject: Re: [DISCUSS] : things we need to solve/decide : storing JSON documents
Date: Mon, 04 Feb 2019 19:59:32 +0000

I've been remiss here in not posting the data model ideas that IBM worked u=
p while we were thinking about using FoundationDB so I'm posting it now. Th=
is is Adam' Kocoloski's original work, I am just transcribing it, and this =
is the context that the folks from the IBM side came in with, for full disc=
losure.

Basics

1. All CouchDB databases are inside a Directory
2. Each CouchDB database is a Directory within that Directory
3. It's possible to list all subdirectories of a Directory, so `_all_dbs` i=
s the list of directories from 1.
4. Each Directory representing a CouchdB database has several Subspaces;
4a. by_id/ doc subspace: actual document contents=20
4b. by_seq/versionstamp subspace: for the _changes feed=20
4c. index_definitions, indexes, ...

JSON Mapping

A hierarchical JSON object naturally maps to multiple KV pairs in FDB:

{=20
    =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,=20
    =E2=80=9Cowner=E2=80=9D: =E2=80=9Cbob=E2=80=9D,=20
    =E2=80=9Cmylist=E2=80=9D: [1,3,5],=20
    =E2=80=9Cmymap=E2=80=9D: {=20
        =E2=80=9Cblue=E2=80=9D: =E2=80=9C#0000FF=E2=80=9D,=20
        =E2=80=9Cred=E2=80=9D: =E2=80=9C#FF0000=E2=80=9D=20
    }=20
}

maps to

(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D =E2=80=9Cbob=E2=80=9D=
=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmylist=E2=80=9D, 0) =3D 1=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmylist=E2=80=9D, 1) =3D 3=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmylist=E2=80=9D, 2) =3D 5=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmymap=E2=80=9D, =E2=80=9Cblue=E2=80=9D) =
=3D =E2=80=9C#0000FF=E2=80=9D=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cmymap=E2=80=9D, =E2=80=9Cred=E2=80=9D) =3D=
 =E2=80=9C#FF0000=E2=80=9D

NB: this means that the 100KB limit applies to individual leafs in the JSON=
 object, not the entire doc

Edit Conflicts

We need to account for the presence of conflicts in various levels of the d=
oc due to replication.

Proposal is to create a special value indicating that the subtree below our=
 current cursor position is in an unresolvable conflict. Then add additiona=
l KV pairs below to describe the conflicting entries.

KV data model allows us to store these efficiently and minimize duplication=
 of data:

A document with these two conflicts:

{=20
    =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,=20
    =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-abc=E2=80=9D,=20
    =E2=80=9Cowner=E2=80=9D: =E2=80=9Calice=E2=80=9D,=20
    =E2=80=9Cactive=E2=80=9D: true=20
}
{=20
    =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,=20
    =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-def=E2=80=9D,=20
    =E2=80=9Cowner=E2=80=9D: =E2=80=9Cbob=E2=80=9D,=20
    =E2=80=9Cactive=E2=80=9D: true=20
}

could be stored thus:

(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D) =3D true=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D kCONFLICT=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-abc=E2=80=9D) =
=3D =E2=80=9Calice=E2=80=9D=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=E2=80=9D) =
=3D =E2=80=9Cbob=E2=80=9D

So long as `kCONFLICT` is set at the top of the conflicting subtree this re=
presentation can handle conflicts of different data types as well.

Missing fields need to be handled explicitly:

{=20
  =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,=20
  =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-abc=E2=80=9D,=20
  =E2=80=9Cowner=E2=80=9D: =E2=80=9Calice=E2=80=9D,=20
  =E2=80=9Cactive=E2=80=9D: true=20
}

{=20
  =E2=80=9C_id=E2=80=9D: =E2=80=9Cfoo=E2=80=9D,=20
  =E2=80=9C_rev=E2=80=9D: =E2=80=9C1-def=E2=80=9D,=20
  =E2=80=9Cowner=E2=80=9D: {=20
    =E2=80=9Cname=E2=80=9D: =E2=80=9Cbob=E2=80=9D,=20
    =E2=80=9Cemail=E2=80=9D: =E2=80=9C
bob@example.com
"=20
  }=20
}

could be stored thus:

(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D) =3D kCONFLICT=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D, =E2=80=9C1-abc=E2=80=9D) =
=3D true=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cactive=E2=80=9D, =E2=80=9C1-def=E2=80=9D) =
=3D kMISSING=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D) =3D kCONFLICT=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-abc=E2=80=9D) =
=3D =E2=80=9Calice=E2=80=9D=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=E2=80=9D, =
=E2=80=9Cname=E2=80=9D) =3D =E2=80=9Cbob=E2=80=9D=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9Cowner=E2=80=9D, =E2=80=9C1-def=E2=80=9D, =
=E2=80=9Cemail=E2=80=9D) =3D ...

Revision Metadata

* CouchDB uses a hash history for revisions=20
** Each edit is identified by the hash of the content of the edit including=
 the base revision against which it was applied=20
** Individual edit branches are bounded in length but the number of branche=
s is potentially unbounded=20

* Size limits preclude us from storing the entire key tree as a single valu=
e; in pathological situations=20
the tree could exceed 100KB (each entry is > 16 bytes)=20

* Store each edit branch as a separate KV including deleted status in a spe=
cial subspace=20

* Structure key representation so that =E2=80=9Cwinning=E2=80=9D revision c=
an be automatically retrieved in a limit=3D1=20
key range operation

(=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_meta=E2=80=9D, =E2=80=9Cdeleted=3Dfalse=
=E2=80=9D, 1, =E2=80=9Cdef=E2=80=9D) =3D []=20
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_meta=E2=80=9D, =E2=80=9Cdeleted=3Dfalse=
=E2=80=9D, 4, =E2=80=9Cbif=E2=80=9D) =3D [=E2=80=9C3-baz=E2=80=9D,=E2=80=9D=
2-bar=E2=80=9D,=E2=80=9D1-foo=E2=80=9D]  <-- winner
(=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_meta=E2=80=9D, =E2=80=9Cdeleted=3Dtrue=E2=
=80=9D, 3, =E2=80=9Cabc=E2=80=9D) =3D [=E2=80=9C2-bar=E2=80=9D, =E2=80=9C1-=
foo=E2=80=9D]

Changes Feed

* FDB supports a concept called a versionstamp =E2=80=94 a 10 byte, unique,=
 monotonically (but not sequentially) increasing value for each committed t=
ransaction. The first 8 bytes are the committed version of the database. Th=
e last 2 bytes are monotonic in the serialization order for transactions.=20

* A transaction can specify a particular index into a key where the followi=
ng 10 bytes will be overwritten by the versionstamp at commit time=20

* A subspace keyed on versionstamp naturally yields a _changes feed

by_seq subspace=20
  (=E2=80=9Cversionstamp1=E2=80=9D) =3D (=E2=80=9Cfoo=E2=80=9D, =E2=80=9C1-=
abc=E2=80=9D)=20
  (=E2=80=9Cversionstamp4=E2=80=9D) =3D (=E2=80=9Cbar=E2=80=9D, =E2=80=9C4-=
def=E2=80=9D)=20

by_id subspace=20
  (=E2=80=9Cbar=E2=80=9D, =E2=80=9C_vsn=E2=80=9D) =3D =E2=80=9Cversionstamp=
4=E2=80=9D=20
  ...=20
  (=E2=80=9Cfoo=E2=80=9D, =E2=80=9C_vsn=E2=80=9D) =3D =E2=80=9Cversionstamp=
1=E2=80=9D

JSON Indexes

* =E2=80=9CMango=E2=80=9D JSON indexes are defined by
** a list of field names, each of which may be nested,=20=20
** an optional partial_filter_selector which constrains the set of docs tha=
t contribute=20
** an optional name defined by the ddoc field (the name is auto-generated i=
f not supplied)=20

* Store index definitions in a single subspace to aid query planning=20
** ((person,name), title, email) =3D (=E2=80=9Cname-title-email=E2=80=9D, =
=E2=80=9C{=E2=80=9Cstudent=E2=80=9D: true}=E2=80=9D)=20
** Store the values for each index in a dedicated subspace, adding the docu=
ment ID as the last element in the tuple=20
*** (=E2=80=9Crosie revere=E2=80=9D, =E2=80=9Cengineer=E2=80=9D, =E2=80=9Cr=
osie@example.com", =E2=80=9Cfoo=E2=80=9D) =3D null

B.

--=20
  Robert Samuel Newson
  rnewson@apache.org

On Mon, 4 Feb 2019, at 19:13, Ilya Khlopotov wrote:
>=20
> I want to fix previous mistakes. I did two mistakes in previous=20
> calculations:
> - I used 1Kb as base size for calculating expansion factor (although we=20
> don't know exact size of original document)
> - The expansion factor calculation included number of revisions (it=20
> shouldn't)
>=20
> I'll focus on flattened JSON docs model
>=20
> The following formula is used in previous calculation.=20
> storage_size_per_document=3Dmapping_table_size*number_of_revisions +=20
> depth*number_of_paths*number_of_revisions +=20
> number_of_paths*value_size*number_of_revisions
>=20
> To clarify things a little bit I want to calculate space requirement for=
=20
> single revision this time.
> mapping_table_size=3Dfield_name_size*(field_name_length+4(integer=20
> size))=3D100 * (20 + 4(integer size)) =3D 2400 bytes
> storage_size_per_document_per_revision_per_replica=3Dmapping_table_size +=
=20
> depth*number_of_paths + value_size*number_of_paths =3D
> 2400bytes + 10*1000+1000*100=3D112400bytes~=3D110 Kb
>=20
> We definitely can reduce requirement for mapping table by adopting=20
> rnewson's idea of a schema.
>=20
> On 2019/02/04 11:08:16, Ilya Khlopotov <iilyak@apache.org> wrote:=20
> > Hi Michael,
> >=20
> > > For example, hears a crazy thought:
> > > Map every distinct occurence of a key/value instance through a crypto=
 hash
> > > function to get a set of hashes.
> > >
> > > These can be be precomputed by Couch without any lookups in FDB.  The=
se
> > > will be spread all over kingdom come in FDB and not lend themselves to
> > > range search well.
> > >=20
> > > So what you do is index them for frequency of occurring in the same s=
et.
> > > In essence, you 'bucket them' statistically, and that bucket id becom=
es a
> > > key prefix. A crypto hash value can be copied into more than one buck=
et.
> > > The {bucket_id}/{cryptohash} becomes a {val_id}
> >=20
> > > When writing a document, Couch submits the list/array of cryptohash v=
alues
> > > it computed to FDB and gets back the corresponding  {val_id} (the id =
with
> > > the bucket prefixed).  This can get somewhat expensive if there's alw=
ays a
> > > lot of app local cache misses.
> > >
> > > A document's value is then a series of {val_id} arrays up to 100k per
> > > segment.
> > >=20
> > > When retrieving a document, you get the val_ids, find the distinct bu=
ckets
> > > and min/max entries for this doc, and then parallel query each bucket=
 while
> > > reconstructing the document.
> >=20
> > Interesting idea. Let's try to think it through to see if we can make i=
t viable.=20
> > Let's go through hypothetical example. Input data for the example:
> > - 1M of documents
> > - each document is around 10Kb
> > - each document consists of 1K of unique JSON paths=20
> > - each document has 100 unique JSON field names
> > - every scalar value is 100 bytes
> > - 10% of unique JSON paths for every document already stored in databas=
e under different doc or different revision of the current one
> > - we assume 3 independent copies for every key-value pair in FDB
> > - our hash key size is 32 bytes
> > - let's assume we can determine if key is already on the storage withou=
t doing query
> > - 1% of paths is in cache (unrealistic value, in real live the percenta=
ge is lower)
> > - every JSON field name is 20 bytes
> > - every JSON path is 10 levels deep
> > - document key prefix length is 50
> > - every document has 10 revisions
> > Let's estimate the storage requirements and size of data we need to tra=
nsmit. The calculations are not exact.
> > 1. storage_size_per_document (we cannot estimate exact numbers since we=
 don't know how FDB stores it)
> >   - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) =3D 38K=
b * 10 * 3 =3D 1140 Kb (11x)
> > 2. number of independent keys to retrieve on document read (non-range q=
ueries) per document
> >   - 1K - (1K * 1%) =3D 990
> > 3. number of range queries: 0
> > 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) =
=3D 102 Kb (10x)=20
> > 5. read latency (we use 2ms per read based on numbers from https://appl=
e.github.io/foundationdb/performance.html)
> >     - sequential: 990*2ms =3D 1980ms=20
> >     - range: 0
> > Let's compare these numbers with initial proposal (flattened JSON docs =
without global schema and without cache)
> > 1. storage_size_per_document
> >   - mapping table size: 100 * (20 + 4(integer size)) =3D 2400 bytes
> >   - key size: (10 * (4 + 1(delimiter))) + 50 =3D 100 bytes=20
> >   - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 =3D 2024=
K =3D 1976 Kb * 3 =3D 5930 Kb (59.3x)
> > 2. number of independent keys to retrieve: 0-2 (depending on index stru=
cture)
> > 3. number of range queries: 1 (1001 of keys in result)
> > 4. data to transmit on read: 24K + 1000*100 + 1000*100 =3D 23.6 Kb (2.4=
x)=20=20
> > 5. read latency (we use 2ms per read based on numbers from https://appl=
e.github.io/foundationdb/performance.html and estimate range read performan=
ce based on numbers from https://apple.github.io/foundationdb/benchmarking.=
html#single-core-read-test)
> >   - range read performance: Given read performance is about 305,000 rea=
ds/second and range performance 3,600,000 keys/second we estimate range per=
formance to be 11.8x compared to read performance. If read performance is 2=
ms than range performance is 0.169ms (which is hard to believe).
> >   - sequential: 2 * 2 =3D 4ms
> >   - range: 0.169
> >=20
> > It looks like we are dealing with a tradeoff:
> > - Map every distinct occurrence of a key/value instance through a crypt=
o hash:
> >   - 5.39x more disk space efficient
> >   - 474x slower
> > - flattened JSON model
> >   - 5.39x less efficient in disk space
> >   - 474x faster
> >=20
> > In any case this unscientific exercise was very helpful. Since it uncov=
ered the high cost in terms of disk space. 59.3x of original disk size is t=
oo much IMO.=20
> >=20
> > Are the any ways we can make Michael's model more performant?
> >=20
> > Also I don't quite understand few aspects of the global hash table prop=
osal:
> >=20
> > 1. > - Map every distinct occurence of a key/value instance through a c=
rypto hash function to get a set of hashes.
> > I think we are talking only about scalar values here? I.e. `"#/foo.bar.=
baz": 123`
> > Since I don't know how we can make it work for all possible JSON paths =
`{"foo": {"bar": {"size": 12, "baz": 123}}}":
> > - foo
> > - foo.bar
> > - foo.bar.baz
> >=20
> > 2. how to delete documents
> >=20
> > Best regards,
> > ILYA
> >=20
> >=20
> > On 2019/01/30 23:33:22, Michael Fair <michael@daclubhouse.net> wrote:=20
> > > On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocolosk@apache.org wr=
ote:
> > >=20
> > > > Hi Michael,
> > > >
> > > > > The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> > > >
> > > > Yes that=E2=80=99s definitely one way to address storage of edit co=
nflicts. I
> > > > think there are other, more compact representations that we can exp=
lore if
> > > > we have this =E2=80=9Cexploded=E2=80=9D data model where each scala=
r value maps to an
> > > > individual KV pair.
> > >=20
> > >=20
> > > I agree, as I mentioned on the original thread, I see a scheme, that
> > > handles both conflicts and revisions, where you only have to store th=
e most
> > > recent change to a field.  Like you suggested, multiple revisions can=
 share
> > > a key.  Which in my mind's eye further begs the conflicts/revisions
> > > discussion along with the working within the limits discussion becaus=
e it
> > > seems to me they are all intrinsically related as a "feature".
> > >=20
> > > Saying 'We'll break documents up into roughly 80k segments', then try=
ing to
> > > overlay some kind of field sharing scheme for revisions/conflicts doe=
sn't
> > > seem like it will work.
> > >=20
> > > I probably should have left out the trivial fix proposal as I don't t=
hink
> > > it's a feasible solution to actually use.
> > >=20
> > > The comment is more regarding that I do not see how this thread can e=
scape
> > > including how to store/retrieve conflicts/revisions.
> > >=20
> > > For instance, the 'doc as individual fields' proposal lends itself to=
 value
> > > sharing across mutiple documents (and I don't just mean revisions of =
the
> > > same doc, I mean the same key/value instance could be shared for every
> > > document).
> > > However that's not really relevant if we're not considering the amoun=
t of
> > > shared information across documents in the storage scheme.
> > >=20
> > > Simply storing documents in <100k segments (perhaps in some kind of
> > > compressed binary representation) to deal with that FDB limit seems f=
ine.
> > > The only reason to consider doing something else is because of its im=
pact
> > > to indexing, searches, reduce functions, revisions, on-disk size impa=
ct,
> > > etc.
> > >=20
> > >=20
> > >=20
> > > > > I'm assuming the process will flatten the key paths of the docume=
nt into
> > > > an array and then request the value of each key as multiple parallel
> > > > queries against FDB at once
> > > >
> > > > Ah, I think this is not one of Ilya=E2=80=99s assumptions. He=E2=80=
=99s trying to design a
> > > > model which allows the retrieval of a document with a single range =
read,
> > > > which is a good goal in my opinion.
> > > >
> > >=20
> > > I am not sure I agree.
> > >=20
> > > Think of bitTorrent, a single range read should pull back the structu=
re of
> > > the document (the pieces to fetch), but not necessarily the whole doc=
ument.
> > >=20
> > > What if you already have a bunch of pieces in common with other docum=
ents
> > > locally (a repeated header/footer/ or type for example); and you only=
 need
> > > to get a few pieces of data you don't already have?
> > >=20
> > > The real goal to Couch I see is to treat your document set like the
> > > collection of structured information that it is.  In some respects li=
ke an
> > > extension of your application's heap space for structured objects and
> > > efficiently querying that collection to get back subsets of the data.
> > >=20
> > > Otherwise it seems more like a slightly upgraded file system plus a f=
ancy
> > > grep/find like feature...
> > >=20
> > > The best way I see to unlock more features/power is to a move towards=
 a
> > > more granular and efficient way to store and retrieve the scalar valu=
es...
> > >=20
> > >=20
> > >=20
> > > For example, hears a crazy thought:
> > > Map every distinct occurence of a key/value instance through a crypto=
 hash
> > > function to get a set of hashes.
> > >=20
> > > These can be be precomputed by Couch without any lookups in FDB.  The=
se
> > > will be spread all over kingdom come in FDB and not lend themselves to
> > > range search well.
> > >=20
> > > So what you do is index them for frequency of occurring in the same s=
et.
> > > In essence, you 'bucket them' statistically, and that bucket id becom=
es a
> > > key prefix. A crypto hash value can be copied into more than one buck=
et.
> > > The {bucket_id}/{cryptohash} becomes a {val_id}
> > >=20
> > > When writing a document, Couch submits the list/array of cryptohash v=
alues
> > > it computed to FDB and gets back the corresponding  {val_id} (the id =
with
> > > the bucket prefixed).  This can get somewhat expensive if there's alw=
ays a
> > > lot of app local cache misses.
> > >=20
> > >=20
> > > A document's value is then a series of {val_id} arrays up to 100k per
> > > segment.
> > >=20
> > > When retrieving a document, you get the val_ids, find the distinct bu=
ckets
> > > and min/max entries for this doc, and then parallel query each bucket=
 while
> > > reconstructing the document.
> > >=20
> > > The values returned from the buckets query are the key/value strings
> > > required to reassemble this document.
> > >=20
> > >=20
> > > ----------
> > > I put this forward primarily to hilite the idea that trying to match =
the
> > > storage representation of documents in a straight forward way to FDB =
keys
> > > to reduce query count might not be the most performance oriented appr=
oach.
> > >=20
> > > I'd much prefer a storage approach that reduced data duplication and
> > > enabled fast sub-document queries.
> > >=20
> > >=20
> > > This clearly falls in the realm of what people want the 'use case' of=
 Couch
> > > to be/become.  By giving Couch more access to sub-document queries, I=
 could
> > > eventually see queries as complicated as GraphQL submitted to Couch a=
nd
> > > pulling back ad-hoc aggregated data across multiple documents in a si=
ngle
> > > application layer request.
> > >=20
> > > Hehe - one way to look at the database of Couch documents is that the=
y are
> > > all conflict revisions of the single root empty document.   What I me=
an be
> > > this is consider thinking of the entire document store as one giant D=
AG of
> > > key/value pairs. How even separate documents are still typically rela=
ted to
> > > each other.  For most applications there is a tremendous amount of da=
ta
> > > redundancy between docs and especially between revisions of those doc=
s...
> > >=20
> > >=20
> > >=20
> > > And all this is a long way of saying "I think there could be a lot of=
 value
> > > in assuming documents are 'assembled' from multiple queries to FDB, w=
ith
> > > local caching, instead of simply retrieved"
> > >=20
> > > Thanks, I hope I'm not the only outlier here thinking this way!?
> > >=20
> > > Mike :-)
> > >=20
> >=20