couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Khlopotov <iil...@apache.org>
Subject Re: [DISCUSS] : things we need to solve/decide : storing JSON documents
Date Mon, 04 Feb 2019 19:13:39 GMT

I want to fix previous mistakes. I did two mistakes in previous calculations:
- I used 1Kb as base size for calculating expansion factor (although we don't know exact size
of original document)
- The expansion factor calculation included number of revisions (it shouldn't)

I'll focus on flattened JSON docs model

The following formula is used in previous calculation. 
storage_size_per_document=mapping_table_size*number_of_revisions + depth*number_of_paths*number_of_revisions
+ number_of_paths*value_size*number_of_revisions

To clarify things a little bit I want to calculate space requirement for single revision this
time.
mapping_table_size=field_name_size*(field_name_length+4(integer size))=100 * (20 + 4(integer
size)) = 2400 bytes
storage_size_per_document_per_revision_per_replica=mapping_table_size + depth*number_of_paths
+ value_size*number_of_paths =
2400bytes + 10*1000+1000*100=112400bytes~=110 Kb

We definitely can reduce requirement for mapping table by adopting rnewson's idea of a schema.

On 2019/02/04 11:08:16, Ilya Khlopotov <iilyak@apache.org> wrote: 
> Hi Michael,
> 
> > For example, hears a crazy thought:
> > Map every distinct occurence of a key/value instance through a crypto hash
> > function to get a set of hashes.
> >
> > These can be be precomputed by Couch without any lookups in FDB.  These
> > will be spread all over kingdom come in FDB and not lend themselves to
> > range search well.
> > 
> > So what you do is index them for frequency of occurring in the same set.
> > In essence, you 'bucket them' statistically, and that bucket id becomes a
> > key prefix. A crypto hash value can be copied into more than one bucket.
> > The {bucket_id}/{cryptohash} becomes a {val_id}
> 
> > When writing a document, Couch submits the list/array of cryptohash values
> > it computed to FDB and gets back the corresponding  {val_id} (the id with
> > the bucket prefixed).  This can get somewhat expensive if there's always a
> > lot of app local cache misses.
> >
> > A document's value is then a series of {val_id} arrays up to 100k per
> > segment.
> > 
> > When retrieving a document, you get the val_ids, find the distinct buckets
> > and min/max entries for this doc, and then parallel query each bucket while
> > reconstructing the document.
> 
> Interesting idea. Let's try to think it through to see if we can make it viable. 
> Let's go through hypothetical example. Input data for the example:
> - 1M of documents
> - each document is around 10Kb
> - each document consists of 1K of unique JSON paths 
> - each document has 100 unique JSON field names
> - every scalar value is 100 bytes
> - 10% of unique JSON paths for every document already stored in database under different
doc or different revision of the current one
> - we assume 3 independent copies for every key-value pair in FDB
> - our hash key size is 32 bytes
> - let's assume we can determine if key is already on the storage without doing query
> - 1% of paths is in cache (unrealistic value, in real live the percentage is lower)
> - every JSON field name is 20 bytes
> - every JSON path is 10 levels deep
> - document key prefix length is 50
> - every document has 10 revisions
> Let's estimate the storage requirements and size of data we need to transmit. The calculations
are not exact.
> 1. storage_size_per_document (we cannot estimate exact numbers since we don't know how
FDB stores it)
>   - 10 * ((10Kb - (10Kb * 10%)) + (1K - (1K * 10%)) * 32 bytes) = 38Kb * 10 * 3 = 1140
Kb (11x)
> 2. number of independent keys to retrieve on document read (non-range queries) per document
>   - 1K - (1K * 1%) = 990
> 3. number of range queries: 0
> 4. data to transmit on read: (1K - (1K * 1%)) * (100 bytes + 32 bytes) = 102 Kb (10x)

> 5. read latency (we use 2ms per read based on numbers from https://apple.github.io/foundationdb/performance.html)
>     - sequential: 990*2ms = 1980ms 
>     - range: 0
> Let's compare these numbers with initial proposal (flattened JSON docs without global
schema and without cache)
> 1. storage_size_per_document
>   - mapping table size: 100 * (20 + 4(integer size)) = 2400 bytes
>   - key size: (10 * (4 + 1(delimiter))) + 50 = 100 bytes 
>   - storage_size_per_document: 2.4K*10 + 100*1K*10 + 1K*100*10 = 2024K = 1976 Kb * 3
= 5930 Kb (59.3x)
> 2. number of independent keys to retrieve: 0-2 (depending on index structure)
> 3. number of range queries: 1 (1001 of keys in result)
> 4. data to transmit on read: 24K + 1000*100 + 1000*100 = 23.6 Kb (2.4x)  
> 5. read latency (we use 2ms per read based on numbers from https://apple.github.io/foundationdb/performance.html
and estimate range read performance based on numbers from https://apple.github.io/foundationdb/benchmarking.html#single-core-read-test)
>   - range read performance: Given read performance is about 305,000 reads/second and
range performance 3,600,000 keys/second we estimate range performance to be 11.8x compared
to read performance. If read performance is 2ms than range performance is 0.169ms (which is
hard to believe).
>   - sequential: 2 * 2 = 4ms
>   - range: 0.169
> 
> It looks like we are dealing with a tradeoff:
> - Map every distinct occurrence of a key/value instance through a crypto hash:
>   - 5.39x more disk space efficient
>   - 474x slower
> - flattened JSON model
>   - 5.39x less efficient in disk space
>   - 474x faster
> 
> In any case this unscientific exercise was very helpful. Since it uncovered the high
cost in terms of disk space. 59.3x of original disk size is too much IMO. 
> 
> Are the any ways we can make Michael's model more performant?
> 
> Also I don't quite understand few aspects of the global hash table proposal:
> 
> 1. > - Map every distinct occurence of a key/value instance through a crypto hash
function to get a set of hashes.
> I think we are talking only about scalar values here? I.e. `"#/foo.bar.baz": 123`
> Since I don't know how we can make it work for all possible JSON paths `{"foo": {"bar":
{"size": 12, "baz": 123}}}":
> - foo
> - foo.bar
> - foo.bar.baz
> 
> 2. how to delete documents
> 
> Best regards,
> ILYA
> 
> 
> On 2019/01/30 23:33:22, Michael Fair <michael@daclubhouse.net> wrote: 
> > On Wed, Jan 30, 2019, 12:57 PM Adam Kocoloski <kocolosk@apache.org wrote:
> > 
> > > Hi Michael,
> > >
> > > > The trivial fix is to use DOCID/REVISIONID as DOC_KEY.
> > >
> > > Yes that’s definitely one way to address storage of edit conflicts. I
> > > think there are other, more compact representations that we can explore if
> > > we have this “exploded” data model where each scalar value maps to an
> > > individual KV pair.
> > 
> > 
> > I agree, as I mentioned on the original thread, I see a scheme, that
> > handles both conflicts and revisions, where you only have to store the most
> > recent change to a field.  Like you suggested, multiple revisions can share
> > a key.  Which in my mind's eye further begs the conflicts/revisions
> > discussion along with the working within the limits discussion because it
> > seems to me they are all intrinsically related as a "feature".
> > 
> > Saying 'We'll break documents up into roughly 80k segments', then trying to
> > overlay some kind of field sharing scheme for revisions/conflicts doesn't
> > seem like it will work.
> > 
> > I probably should have left out the trivial fix proposal as I don't think
> > it's a feasible solution to actually use.
> > 
> > The comment is more regarding that I do not see how this thread can escape
> > including how to store/retrieve conflicts/revisions.
> > 
> > For instance, the 'doc as individual fields' proposal lends itself to value
> > sharing across mutiple documents (and I don't just mean revisions of the
> > same doc, I mean the same key/value instance could be shared for every
> > document).
> > However that's not really relevant if we're not considering the amount of
> > shared information across documents in the storage scheme.
> > 
> > Simply storing documents in <100k segments (perhaps in some kind of
> > compressed binary representation) to deal with that FDB limit seems fine.
> > The only reason to consider doing something else is because of its impact
> > to indexing, searches, reduce functions, revisions, on-disk size impact,
> > etc.
> > 
> > 
> > 
> > > > I'm assuming the process will flatten the key paths of the document into
> > > an array and then request the value of each key as multiple parallel
> > > queries against FDB at once
> > >
> > > Ah, I think this is not one of Ilya’s assumptions. He’s trying to design
a
> > > model which allows the retrieval of a document with a single range read,
> > > which is a good goal in my opinion.
> > >
> > 
> > I am not sure I agree.
> > 
> > Think of bitTorrent, a single range read should pull back the structure of
> > the document (the pieces to fetch), but not necessarily the whole document.
> > 
> > What if you already have a bunch of pieces in common with other documents
> > locally (a repeated header/footer/ or type for example); and you only need
> > to get a few pieces of data you don't already have?
> > 
> > The real goal to Couch I see is to treat your document set like the
> > collection of structured information that it is.  In some respects like an
> > extension of your application's heap space for structured objects and
> > efficiently querying that collection to get back subsets of the data.
> > 
> > Otherwise it seems more like a slightly upgraded file system plus a fancy
> > grep/find like feature...
> > 
> > The best way I see to unlock more features/power is to a move towards a
> > more granular and efficient way to store and retrieve the scalar values...
> > 
> > 
> > 
> > For example, hears a crazy thought:
> > Map every distinct occurence of a key/value instance through a crypto hash
> > function to get a set of hashes.
> > 
> > These can be be precomputed by Couch without any lookups in FDB.  These
> > will be spread all over kingdom come in FDB and not lend themselves to
> > range search well.
> > 
> > So what you do is index them for frequency of occurring in the same set.
> > In essence, you 'bucket them' statistically, and that bucket id becomes a
> > key prefix. A crypto hash value can be copied into more than one bucket.
> > The {bucket_id}/{cryptohash} becomes a {val_id}
> > 
> > When writing a document, Couch submits the list/array of cryptohash values
> > it computed to FDB and gets back the corresponding  {val_id} (the id with
> > the bucket prefixed).  This can get somewhat expensive if there's always a
> > lot of app local cache misses.
> > 
> > 
> > A document's value is then a series of {val_id} arrays up to 100k per
> > segment.
> > 
> > When retrieving a document, you get the val_ids, find the distinct buckets
> > and min/max entries for this doc, and then parallel query each bucket while
> > reconstructing the document.
> > 
> > The values returned from the buckets query are the key/value strings
> > required to reassemble this document.
> > 
> > 
> > ----------
> > I put this forward primarily to hilite the idea that trying to match the
> > storage representation of documents in a straight forward way to FDB keys
> > to reduce query count might not be the most performance oriented approach.
> > 
> > I'd much prefer a storage approach that reduced data duplication and
> > enabled fast sub-document queries.
> > 
> > 
> > This clearly falls in the realm of what people want the 'use case' of Couch
> > to be/become.  By giving Couch more access to sub-document queries, I could
> > eventually see queries as complicated as GraphQL submitted to Couch and
> > pulling back ad-hoc aggregated data across multiple documents in a single
> > application layer request.
> > 
> > Hehe - one way to look at the database of Couch documents is that they are
> > all conflict revisions of the single root empty document.   What I mean be
> > this is consider thinking of the entire document store as one giant DAG of
> > key/value pairs. How even separate documents are still typically related to
> > each other.  For most applications there is a tremendous amount of data
> > redundancy between docs and especially between revisions of those docs...
> > 
> > 
> > 
> > And all this is a long way of saying "I think there could be a lot of value
> > in assuming documents are 'assembled' from multiple queries to FDB, with
> > local caching, instead of simply retrieved"
> > 
> > Thanks, I hope I'm not the only outlier here thinking this way!?
> > 
> > Mike :-)
> > 
> 

Mime
View raw message