incubator-couchdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Randall Leeds <randall.le...@gmail.com>
Subject Re: Universal Binary JSON in CouchDB
Date Tue, 04 Oct 2011 19:02:02 GMT
Hey,

Thanks for this thread.

I've been interested in ways to reduce the work from disk to client as well.
Unfortunately, the metadata inside the document objects is variable based on
query parameters (_attachments, _revisions, _revs_info...) so the server
needs to decode the disk binary anyway.

I would say this is something we should carefully consider for a 2.0 api. I
know that, for simplicity, many people really like having the underscore
prefixed attributes mixed in right alongside the document data, but a future
API that separated these could really make things fly.

-Randall

On Wed, Sep 28, 2011 at 22:25, Benoit Chesneau <bchesneau@gmail.com> wrote:

> On Thursday, September 29, 2011, Riyad Kalla <rkalla@gmail.com> wrote:
> > DISCLAIMER: This looks long, but reads quickly (I hope). If you are in a
> > rush,
> > just check the last 2 sections and see if it sounds interesting.
> >
> >
> > Hi everybody. I am new to the list, but a big fan of Couch and I have
> been
> > working
> > on something I wanted to share with the group.
> >
> > My appologies if this isn't the right venue or list ediquette... I wasn't
> > really
> > sure where to start with this conversation.
> >
> >
> > Background
> > =====================
> > With the help of the JSON spec community I've been finalizing a
> universal,
> > binary JSON format specification that offers 1:1 compatibility with JSON.
> >
> > The full spec is here (http://ubjson.org/) and the quick list of types
> is
> > here
> > (http://ubjson.org/type-reference/). Differences with existing specs and
> > "Why" are
> > all addressed on the site in the first few sections.
> >
> > The goal of the specification was first to maintain 1:1 compatibility
> with
> > JSON
> > (no custom data structures - like what caused BSON to be rejected in
> Issue
> > #702),
> > secondly to be as simple to work with as regular JSON (no complex data
> > structures or
> > encoding/decoding algorithms to implement) and lastly, it had to be
> smaller
> > than
> > compacted JSON and faster to generate and parse.
> >
> > Using a test doc that I see Filipe reference in a few of his issues
> > (http://friendpaste.com/qdfyId8w1C5vkxROc5Thf) I get the following
> > compression:
> >
> > * Compacted JSON: 3,861 bytes
> > * Univ. Binary JSON: 3,056 bytes (20% smaller)
> >
> > In some other sample data (e.g. jvm-serializers sample data) I see a 27%
> > compression
> > with a typical compression range of 20-30%.
> >
> > While these compression levels are average, the data is written out in an
> > unmolested
> > format that is optimized for read speed (no scanning for null
> terminators)
> > and criminally
> > simple to work with. (win-win)
> >
> > I added more clarifying information about compression characteristics in
> the
> > "Size Requirements"
> > section of the spec for anyone interested.
> >
> >
> > Motivation
> > ======================
> > I've been following the discussions surround a native binary JSON format
> for
> > the core
> > CouchDB file (Issue #1092) which transformed into keeping the format and
> > utilizing
> > Google's Snappy (Issue #1120) to provide what looks to be roughly a
> 40-50%
> > reduction in file
> > size at the cost of running the compression/decompression on every
> > read/write.
> >
> > I realize in light of the HTTP transport and JSON encoding/decoding cycle
> in
> > CouchDB, the
> > Snappy compression cycles are a very small part of the total time the
> server
> > spends working.
> >
> > I found this all interesting, but like I said, I realized up to this
> point
> > that Snappy
> > wasn't any form of bottleneck and the big compression wins server side
> were
> > great so I had
> > nothing to contribute to the conversation.
> >
> >
> > Catalyst
> > ======================
> > This past week I watched Tim Anglade's presentation (http://goo.gl/LLucD
> )
> > and started to
> > foam at the mouth when I saw his slides where he skipped the JSON
> > encode/decode cycle
> > server-side and just generated straight from binary on disk into
> MessagePack
> > and got
> > some phenomenal speedups from the server:
> > http://i.imgscalr.com/XKqXiLusT.png
> >
> > I pinged Tim to see what the chances of adding Univ Binary JSON support
> was
> > and he seemed
> > ameanable to the idea as long as I could hand him an Erlang or Ruby impl
> > (unfortunately,
> > I am not familiar with either).
> >
> >
> > ah-HA! moment
> > ======================
> > Today it occurred to me that if CouchDB were able to (at the cost of 20%
> > more disk space
> > than it is using with Snappy enabled, but still 20% *less* than before
> > Snappy was integrated)
> > use the Universal Binary JSON format as its native storage format AND
> > support for serving replies
> > using the same format was added (a-la Tim's work), this would allow
> CouchDB
> > to (theoretically)
> > reply to queries by pulling bytes off disk (or memory) and immediately
> > streaming them back to
> > the caller with no intermediary step at all (no Snappy decompress, no
> Erlang
> > decode, no JSON encode).
> >
> > Given that the Univ Binary JSON spec is standard, easy to parse and
> simple
> > to convert back to
> > JSON, adding support for it seemed more consistent with Couch's motto of
> > ease and simplicity
> > than say MessagePack or Protobuff which provide better compression but at
> > the cost of more
> > complex formats and data types that have no ancillary in JSON.
> >
> > I don't know the intracacies of Couch's internals, if that is wrong and
> some
> > Erlang
> > manipulation of the data would still be required, I believe it would
> still
> > be faster to pull the data
> > off disk in the Univ Binary JSON format, decode to Erlang native types
> and
> > then reply while
> > skipping the Snappy decompression step.
> >
> > If it *would* be possible to stream it back un-touched directly from
> disk,
> > that seems like
> > an enhancement that could potentially speed up CouchDB by at least an
> order
> > of magnitude.
> >
> >
> > Conclusion
> > =======================
> > I would appreciate any feedback on this idea from you guys with a lot
> more
> > knowledge of
> > the internals.
> >
> > I have no problem if this is a horrible idea and never going to happen, I
> > just wanted to try
> > and contribute something back.
> >
> >
> > Thank you all for reading.
> >
> > Best wishes,
> > Riyad
> >
>
> what is universal in something new?
>
> -  benoit
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message