couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Newson <rnew...@apache.org>
Subject Re: [DISCUSS] : things we need to solve/decide : storing JSON documents
Date Mon, 04 Feb 2019 18:40:43 GMT
I think we're deep in the weeds on this small aspect of the data model problem, and haven't
touched other aspects yet. The numbers used in your example (1k of paths, 100 unique field
names, 100 bytes for a value), where are they from? If they are not from some empirical data
source, I don't see any reason to dwell on anything we might infer from them.

I think we should focus on the simplest model that also 'works' (i.e, delivers all essential
properties) and then prototype so we can see how efficient it is.

I am happy to sacrifice some degree of efficiency for a comprehensible mapping of documents
to key-value pairs and we have at least three techniques to address long keys so far. We also
know there are other approaches to this problem if necessary that have a much smaller storage
overhead (adjacent rows of 100k chunks of the couchdb document treated as a blob). 

-- 
  Robert Samuel Newson
  rnewson@apache.org

On Mon, 4 Feb 2019, at 18:29, Ilya Khlopotov wrote:
> At some point I changed the number of unique JSON paths and probably 
> forgot to update other conditions.
> The ` - each document is around 10Kb` is not used in the calculations so 
> can be ignored.
> 
> On 2019/02/04 17:46:20, Adam Kocoloski <kocolosk@apache.org> wrote: 
> > Ugh! We definitely cannot have a model where a 10K JSON document is exploded into
2MB worth of KV data. I’ve tried several times to follow the math here but I’m failing.
I can’t even get past this first bit:
> > 
> > > - each document is around 10Kb
> > > - each document consists of 1K of unique JSON paths 
> > > - each document has 100 unique JSON field names
> > > - every scalar value is 100 bytes
> > 
> > If each document has 1000 paths, and each path (which leads to a unique scalar value,
right?) has a value of 100 bytes associated with it … how is the document 10KB? Wouldn’t
it need to be at least 100KB just by adding up all the scalar values?
> > 
> > Adam
> > 
> > > On Feb 4, 2019, at 6:08 AM, Ilya Khlopotov <iilyak@apache.org> wrote:
> > > 
> > > Hi Michael,
> > > 
> > >> For example, hears a crazy thought:
> > >> Map every distinct occurence of a key/value instance through a crypto hash
> > >> function to get a set of hashes.
> > >> 
> > >> These can be be precomputed by Couch without any lookups in FDB.  These
> > >> will be spread all over kingdom come in FDB and not lend themselves to
> > >> range search well.
> > >> 
> > >> So what you do is index them for frequency of occurring in the same set.
> > >> In essence, you 'bucket them' statistically, and that bucket id becomes
a
> > >> key prefix. A crypto hash value can be copied into more than one bucket.
> > >> The {bucket_id}/{cryptohash} becomes a {val_id}
> > > 
> > >> When writing a document, Couch submits the list/array of cryptohash values
> > >> it computed to FDB and gets back the corresponding  {val_id} (the id with
> > >> the bucket prefixed).  This can get somewhat expensive if there's always
a
> > >> lot of app local cache misses.
> > >> 
> > >> A document's value is then a series of {val_id} arrays up to 100k per
> > >> segment.
> > >> 
> > >> When retrieving a document, you get the val_ids, find the distinct buckets
> > >> and min/max entries for this doc, and then parallel query each bucket while
> > >> reconstructing the document.
> > > 
> > > Interesting idea. Let's try to think it through to see if we can make it viable.

> > > Let's go through hypothetical example. Input data for the example:
> > > - 1M of documents
> > > - each document is around 10Kb
> > > - each document consists of 1K of unique JSON paths 
> > > - each document has 100 unique JSON field names
> > > - every scalar value is 100 bytes
> > > - 10% of unique JSON paths for every document already stored in database under
different doc or different revision of the current one
> > > - we assume 3 independent copies for every key-value pair in FDB
> > > - our hash key size is 32 bytes
> > > - let's assume we can determine if key is already on the storage without doing
query
> > > - 1% of paths is in cache (unrealistic value, in real live the percentage is
lower)
> > > - every JSON field name is 20 bytes
> > > - every JSON path is 10 levels deep
> > > - document key prefix length is 50
> > > - every document has 10 revisions
> > > Let's estimate the storage requirements and size of data we need to transmit.
The calculations are not exact.
> > > 1. storage_size_per_document (we cannot estimate exact numbers since we don't
know how FDB stores it)
> > >  - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10
* 3 = 1140 Kb (11x)
> > > 2. number of independent keys to retrieve on document read (non-range queries)
per document
> > >  - 1K - (1K * 1%) = 990
> > > 3. number of range queries: 0
> > > 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 102
Kb (10x) 
> > > 5. read latency (we use 2ms per read based on numbers from https://apple.github.io/foundationdb/performance.html)
> > >    - sequential: 990*2ms = 1980ms 
> > >    - range: 0
> > > Let's compare these numbers with initial proposal (flattened JSON docs without
global schema and without cache)
> > > 1. storage_size_per_document
> > >  - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
> > >  - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
> > >  - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 1976
Kb * 3 = 5930 Kb (59.3x)
> > > 2. number of independent keys to retrieve: 0-2 (depending on index structure)
> > > 3. number of range queries: 1 (1001 of keys in result)
> > > 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
> > > 5. read latency (we use 2ms per read based on numbers from https://apple.github.io/foundationdb/performance.html
and estimate range read performance based on numbers from https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
> > >  - range read performance: Given read performance is about 305,000 reads/second
and range performance 3,600,000 keys/second we estimate range performance to be 11.8x compared
to read performance. If read performance is 2ms than range performance is 0.169ms (which is
hard to believe).
> > >  - sequential: 2 * 2 = 4ms
> > >  - range: 0.169
> > > 
> > > It looks like we are dealing with a tradeoff:
> > > - Map every distinct occurrence of a key/value instance through a crypto hash:
> > >  - 5.39x more disk space efficient
> > >  - 474x slower
> > > - flattened JSON model
> > >  - 5.39x less efficient in disk space
> > >  - 474x faster
> > > 
> > > In any case this unscientific exercise was very helpful. Since it uncovered
the high cost in terms of disk space. 59.3x of original disk size is too much IMO. 
> > > 
> > > Are the any ways we can make Michael's model more performant?
> > > 
> > > Also I don't quite understand few aspects of the global hash table proposal:
> > > 
> > > 1. > - Map every distinct occurence of a key/value instance through a crypto
hash function to get a set of hashes.
> > > I think we are talking only about scalar values here? I.e. `"#/foo.bar.baz":
123`
> > > Since I don't know how we can make it work for all possible JSON paths `{"foo":
{"bar": {"size": 12, "baz": 123}}}":
> > > - foo
> > > - foo.bar
> > > - foo.bar.baz
> > > 
> > > 2. how to delete documents
> > > 
> > > Best regards,
> > > ILYA
> > > 
> > > 
> > > On 2019/01/30 23:33:22, Michael Fair <michael@daclubhouse.net> wrote:

> > >> On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocolosk@apache.org wrote:
> > >> 
> > >>> Hi Michael,
> > >>> 
> > >>>> The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> > >>> 
> > >>> Yes that’s definitely one way to address storage of edit conflicts.
I
> > >>> think there are other, more compact representations that we can explore
if
> > >>> we have this “exploded” data model where each scalar value maps
to an
> > >>> individual KV pair.
> > >> 
> > >> 
> > >> I agree, as I mentioned on the original thread, I see a scheme, that
> > >> handles both conflicts and revisions, where you only have to store the
most
> > >> recent change to a field.  Like you suggested, multiple revisions can share
> > >> a key.  Which in my mind's eye further begs the conflicts/revisions
> > >> discussion along with the working within the limits discussion because
it
> > >> seems to me they are all intrinsically related as a "feature".
> > >> 
> > >> Saying 'We'll break documents up into roughly 80k segments', then trying
to
> > >> overlay some kind of field sharing scheme for revisions/conflicts doesn't
> > >> seem like it will work.
> > >> 
> > >> I probably should have left out the trivial fix proposal as I don't think
> > >> it's a feasible solution to actually use.
> > >> 
> > >> The comment is more regarding that I do not see how this thread can escape
> > >> including how to store/retrieve conflicts/revisions.
> > >> 
> > >> For instance, the 'doc as individual fields' proposal lends itself to value
> > >> sharing across mutiple documents (and I don't just mean revisions of the
> > >> same doc, I mean the same key/value instance could be shared for every
> > >> document).
> > >> However that's not really relevant if we're not considering the amount
of
> > >> shared information across documents in the storage scheme.
> > >> 
> > >> Simply storing documents in <100k segments (perhaps in some kind of
> > >> compressed binary representation) to deal with that FDB limit seems fine.
> > >> The only reason to consider doing something else is because of its impact
> > >> to indexing, searches, reduce functions, revisions, on-disk size impact,
> > >> etc.
> > >> 
> > >> 
> > >> 
> > >>>> I'm assuming the process will flatten the key paths of the document
into
> > >>> an array and then request the value of each key as multiple parallel
> > >>> queries against FDB at once
> > >>> 
> > >>> Ah, I think this is not one of Ilya’s assumptions. He’s trying
to design a
> > >>> model which allows the retrieval of a document with a single range
read,
> > >>> which is a good goal in my opinion.
> > >>> 
> > >> 
> > >> I am not sure I agree.
> > >> 
> > >> Think of bitTorrent, a single range read should pull back the structure
of
> > >> the document (the pieces to fetch), but not necessarily the whole document.
> > >> 
> > >> What if you already have a bunch of pieces in common with other documents
> > >> locally (a repeated header/footer/ or type for example); and you only need
> > >> to get a few pieces of data you don't already have?
> > >> 
> > >> The real goal to Couch I see is to treat your document set like the
> > >> collection of structured information that it is.  In some respects like
an
> > >> extension of your application's heap space for structured objects and
> > >> efficiently querying that collection to get back subsets of the data.
> > >> 
> > >> Otherwise it seems more like a slightly upgraded file system plus a fancy
> > >> grep/find like feature...
> > >> 
> > >> The best way I see to unlock more features/power is to a move towards a
> > >> more granular and efficient way to store and retrieve the scalar values...
> > >> 
> > >> 
> > >> 
> > >> For example, hears a crazy thought:
> > >> Map every distinct occurence of a key/value instance through a crypto hash
> > >> function to get a set of hashes.
> > >> 
> > >> These can be be precomputed by Couch without any lookups in FDB.  These
> > >> will be spread all over kingdom come in FDB and not lend themselves to
> > >> range search well.
> > >> 
> > >> So what you do is index them for frequency of occurring in the same set.
> > >> In essence, you 'bucket them' statistically, and that bucket id becomes
a
> > >> key prefix. A crypto hash value can be copied into more than one bucket.
> > >> The {bucket_id}/{cryptohash} becomes a {val_id}
> > >> 
> > >> When writing a document, Couch submits the list/array of cryptohash values
> > >> it computed to FDB and gets back the corresponding  {val_id} (the id with
> > >> the bucket prefixed).  This can get somewhat expensive if there's always
a
> > >> lot of app local cache misses.
> > >> 
> > >> 
> > >> A document's value is then a series of {val_id} arrays up to 100k per
> > >> segment.
> > >> 
> > >> When retrieving a document, you get the val_ids, find the distinct buckets
> > >> and min/max entries for this doc, and then parallel query each bucket while
> > >> reconstructing the document.
> > >> 
> > >> The values returned from the buckets query are the key/value strings
> > >> required to reassemble this document.
> > >> 
> > >> 
> > >> ----------
> > >> I put this forward primarily to hilite the idea that trying to match the
> > >> storage representation of documents in a straight forward way to FDB keys
> > >> to reduce query count might not be the most performance oriented approach.
> > >> 
> > >> I'd much prefer a storage approach that reduced data duplication and
> > >> enabled fast sub-document queries.
> > >> 
> > >> 
> > >> This clearly falls in the realm of what people want the 'use case' of Couch
> > >> to be/become.  By giving Couch more access to sub-document queries, I could
> > >> eventually see queries as complicated as GraphQL submitted to Couch and
> > >> pulling back ad-hoc aggregated data across multiple documents in a single
> > >> application layer request.
> > >> 
> > >> Hehe - one way to look at the database of Couch documents is that they
are
> > >> all conflict revisions of the single root empty document.   What I mean
be
> > >> this is consider thinking of the entire document store as one giant DAG
of
> > >> key/value pairs. How even separate documents are still typically related
to
> > >> each other.  For most applications there is a tremendous amount of data
> > >> redundancy between docs and especially between revisions of those docs...
> > >> 
> > >> 
> > >> 
> > >> And all this is a long way of saying "I think there could be a lot of value
> > >> in assuming documents are 'assembled' from multiple queries to FDB, with
> > >> local caching, instead of simply retrieved"
> > >> 
> > >> Thanks, I hope I'm not the only outlier here thinking this way!?
> > >> 
> > >> Mike :-)
> > >> 
> > 
> > 

Mime
View raw message