couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Davis" <>
Subject Re: drill into a doc with a GET?
Date Thu, 08 Jan 2009 03:43:49 GMT
On Wed, Jan 7, 2009 at 5:13 PM, Sho Fukamachi <> wrote:
> On 08/01/2009, at 1:24 PM, Paul Davis wrote:
>> [..]
>> function(doc)
>> {
>>   for(var field in doc)
>>  {
>>      emit([field, doc._id], doc[field]);
>>  }
>> }
> I call that an "exploded index" and worry somewhat about its storage usage.

I like it, the closest I could think of was "inverted index with repeated keys"

> Two concerns:
> - you'd be needlessly re-storing the large data that the OP wanted to avoid
> transferring. Presumably it's big. You might be able to manually exclude it
> if it always has the same name, of course

The big-ness would be stored to disk and not necessarily retrieved,
but of course there are plenty of ways to optimize the idea to a
specific use case.

> - if there's a lot of records with a lot of small fields, index overhead
> might double or even triple the database size
> An alternative might be to take the reverse to that approach, and write a
> view which returned all the field except the large entry (if known) you're
> trying to avoid transferring. That way, you'd avoid having to re-store those
> large fields in the index as well.

Like before, I was just trying to show the idea. I see the two as
extremes of the continuum.

> Storage is cheap*, but obviously it would be bad practise to needlessly
> double (or worse) the database size.
> I have often wondered the exact overhead of a row in a view index.
> Obviously, if it's more than a few bytes, it's going to be a factor to
> consider when contemplating view index strategies which generate an awful
> lot of index rows. If there are a large number of fields with a small amount
> of data in each, and a large number of documents, it is quite plausible the
> "exploded index" could be several times the original size of the data.

It's possible.

> Anyone with inside knowledge want to chip in on that? What would be the
> approximate overhead, per-entry, of an exploded view index as described by
> Paul? Or maybe I should just test it, since I've been wondering about that
> for a while ...

I'm not 100% certain what binary representation Erlang uses to store
its data. The few representations I worked with on the C driver side
are term type dependent, so it'd be hard to give an exact answer if
any of them were the actual 'native' binary representation.

Also, determining this analytically would require an approximation to
amortize the tree structure to a per-row level.

Also, there's no compaction for views yet which, again depending on
the use case could lead to inflated values.

> Sho
> * good storage is not actually cheap

I would say that this depends on your definition of storage. Huge SCSI
raid arrays can cost a chunk of change for sure. And there are even
ranges of drives themselves. Though, there is the highly amusing
google white paper on comparing failure rates of disk drives. Basic
conclusion was that the highest indicator of failure was the lot
number of the drive.

Though, even with all that, I would be interested in any numbers you
come up with. Counting and measuring is always a good use of time.

Paul Davis

>> Then to access a specific property:
>> "docid1"]
>> HTH,
>> Paul Davis
>> On Wed, Jan 7, 2009 at 4:18 PM, Robert Koberg <> wrote:
>>> Hi,
>>> first, couchdb is just beautiful! :) (using 0.8.1-incubating from
>>> MacPorts)
>>> I am very new, and have read the available docs and several blog posts.
>>> Can you drill into a doc with a simple GET?
>>> Say I have a doc like:
>>> {"_id": "a", "_rev": "123", "foo":{"bar": 1}, "big-ass-prop": "huge
>>> amount
>>> of stuff"}
>>> Ideally I would like to be able to call something like:
>>> to return {"bar":1} and avoid downloading "big-ass-prop"
>>> Is this or something like it possible?
>>> (I realize "foo" is a 'sibling' of the _id in the document, but it is
>>> probably treated more like a parent in the DB?)
>>> If not possible, is it possible to create some kind of default
>>> action/filter/? that does something like the above? That is, reads the
>>> request uri, recognizes it is a document and that there is extra path
>>> info
>>> which should be used to resolve a property.
>>> thanks,
>>> -Rob

View raw message