couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riyad Kalla <rka...@gmail.com>
Subject Re: Universal Binary JSON in CouchDB
Date Tue, 04 Oct 2011 19:23:06 GMT
Hey Randall,

This is something that Paul and I discussed on IRC. The way UBJ is written
out looks something like this ([] blocks are just for readability):
[o][2]
  [s][4]name[s][3][bob]
  [s][3][age][i][31]

Couch can easily prepend or append its own dynamic content in a reply. If it
wants to prepend some information after the object header, the header would
need to be stored and manipulated by couch separately.

For example, if I upload the doc above, Couch would want to take that root
object header of:
[o][2]

and change it to:
[o][4]

before storing it because of the additions of _id and _rev. Actually this
could be as simple as storing a "rootObjectCount" and have couch dynamically
generate the root every time.

'o' represents object containers with <= 254 elements (1 byte for length)
and 'O' represents object containers with up to 2.1 billion elements (4 byte
int).

If couch did that any request coming into the server might look like this:
<- client request
-- (server loads root object count)
-> server writes back object header: [o][4]
-- (server calculates dynamic data)
-> server writes back dynamic content
-> server streams raw record data straight off disk to client (no
deserialization)
-- (OPT: server calculates dynamic data)
--> OPT: server streams dynamic data appended

Thoughts?

Best,
Riyad

P.S.> There is support in the spec for unbounded container types when couch
doesn't know how much it is streaming back, but that isn't necessary for
retrieving stored docs (but could be handy when responding to view queries
and other requests whose length is not known in advance)

On Tue, Oct 4, 2011 at 12:02 PM, Randall Leeds <randall.leeds@gmail.com>wrote:

> Hey,
>
> Thanks for this thread.
>
> I've been interested in ways to reduce the work from disk to client as
> well.
> Unfortunately, the metadata inside the document objects is variable based
> on
> query parameters (_attachments, _revisions, _revs_info...) so the server
> needs to decode the disk binary anyway.
>
> I would say this is something we should carefully consider for a 2.0 api. I
> know that, for simplicity, many people really like having the underscore
> prefixed attributes mixed in right alongside the document data, but a
> future
> API that separated these could really make things fly.
>
> -Randall
>
> On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau <bchesneau@gmail.com>
> wrote:
>
> > On Thursday, September 29, 2011, Riyad Kalla <rkalla@gmail.com> wrote:
> > > DISCLAIMER: This looks long, but reads quickly (I hope). If you are in
> a
> > > rush,
> > > just check the last 2 sections and see if it sounds interesting.
> > >
> > >
> > > Hi everybody. I am new to the list, but a big fan of Couch and I have
> > been
> > > working
> > > on something I wanted to share with the group.
> > >
> > > My appologies if this isn't the right venue or list ediquette... I
> wasn't
> > > really
> > > sure where to start with this conversation.
> > >
> > >
> > > Background
> > > =====================
> > > With the help of the JSON spec community I've been finalizing a
> > universal,
> > > binary JSON format specification that offers 1:1 compatibility with
> JSON.
> > >
> > > The full spec is here (http://ubjson.org/) and the quick list of types
> > is
> > > here
> > > (http://ubjson.org/type-reference/). Differences with existing specs
> and
> > > "Why" are
> > > all addressed on the site in the first few sections.
> > >
> > > The goal of the specification was first to maintain 1:1 compatibility
> > with
> > > JSON
> > > (no custom data structures - like what caused BSON to be rejected in
> > Issue
> > > #702),
> > > secondly to be as simple to work with as regular JSON (no complex data
> > > structures or
> > > encoding/decoding algorithms to implement) and lastly, it had to be
> > smaller
> > > than
> > > compacted JSON and faster to generate and parse.
> > >
> > > Using a test doc that I see Filipe reference in a few of his issues
> > > (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
> > > compression:
> > >
> > > * Compacted JSON: 3,861 bytes
> > > * Univ. Binary JSON: 3,056 bytes (20% smaller)
> > >
> > > In some other sample data (e.g. jvm-serializers sample data) I see a
> 27%
> > > compression
> > > with a typical compression range of 20-30%.
> > >
> > > While these compression levels are average, the data is written out in
> an
> > > unmolested
> > > format that is optimized for read speed (no scanning for null
> > terminators)
> > > and criminally
> > > simple to work with. (win-win)
> > >
> > > I added more clarifying information about compression characteristics
> in
> > the
> > > "Size Requirements"
> > > section of the spec for anyone interested.
> > >
> > >
> > > Motivation
> > > ======================
> > > I've been following the discussions surround a native binary JSON
> format
> > for
> > > the core
> > > CouchDB file (Issue #1092) which transformed into keeping the format
> and
> > > utilizing
> > > Google's Snappy (Issue #1120) to provide what looks to be roughly a
> > 40-50%
> > > reduction in file
> > > size at the cost of running the compression/decompression on every
> > > read/write.
> > >
> > > I realize in light of the HTTP transport and JSON encoding/decoding
> cycle
> > in
> > > CouchDB, the
> > > Snappy compression cycles are a very small part of the total time the
> > server
> > > spends working.
> > >
> > > I found this all interesting, but like I said, I realized up to this
> > point
> > > that Snappy
> > > wasn't any form of bottleneck and the big compression wins server side
> > were
> > > great so I had
> > > nothing to contribute to the conversation.
> > >
> > >
> > > Catalyst
> > > ======================
> > > This past week I watched Tim Anglade's presentation (
> http://goo.gl/LLucD
> > )
> > > and started to
> > > foam at the mouth when I saw his slides where he skipped the JSON
> > > encode/decode cycle
> > > server-side and just generated straight from binary on disk into
> > MessagePack
> > > and got
> > > some phenomenal speedups from the server:
> > > http://i.imgscalr.com/XKqXiLusT.png
> > >
> > > I pinged Tim to see what the chances of adding Univ Binary JSON support
> > was
> > > and he seemed
> > > ameanable to the idea as long as I could hand him an Erlang or Ruby
> impl
> > > (unfortunately,
> > > I am not familiar with either).
> > >
> > >
> > > ah-HA! moment
> > > ======================
> > > Today it occurred to me that if CouchDB were able to (at the cost of
> 20%
> > > more disk space
> > > than it is using with Snappy enabled, but still 20% *less* than before
> > > Snappy was integrated)
> > > use the Universal Binary JSON format as its native storage format AND
> > > support for serving replies
> > > using the same format was added (a-la Tim's work), this would allow
> > CouchDB
> > > to (theoretically)
> > > reply to queries by pulling bytes off disk (or memory) and immediately
> > > streaming them back to
> > > the caller with no intermediary step at all (no Snappy decompress, no
> > Erlang
> > > decode, no JSON encode).
> > >
> > > Given that the Univ Binary JSON spec is standard, easy to parse and
> > simple
> > > to convert back to
> > > JSON, adding support for it seemed more consistent with Couch's motto
> of
> > > ease and simplicity
> > > than say MessagePack or Protobuff which provide better compression but
> at
> > > the cost of more
> > > complex formats and data types that have no ancillary in JSON.
> > >
> > > I don't know the intracacies of Couch's internals, if that is wrong and
> > some
> > > Erlang
> > > manipulation of the data would still be required, I believe it would
> > still
> > > be faster to pull the data
> > > off disk in the Univ Binary JSON format, decode to Erlang native types
> > and
> > > then reply while
> > > skipping the Snappy decompression step.
> > >
> > > If it *would* be possible to stream it back un-touched directly from
> > disk,
> > > that seems like
> > > an enhancement that could potentially speed up CouchDB by at least an
> > order
> > > of magnitude.
> > >
> > >
> > > Conclusion
> > > =======================
> > > I would appreciate any feedback on this idea from you guys with a lot
> > more
> > > knowledge of
> > > the internals.
> > >
> > > I have no problem if this is a horrible idea and never going to happen,
> I
> > > just wanted to try
> > > and contribute something back.
> > >
> > >
> > > Thank you all for reading.
> > >
> > > Best wishes,
> > > Riyad
> > >
> >
> > what is universal in something new?
> >
> > -  benoit
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message