Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (nike.apache.org: domain of djc.ochtman@gmail.com
 designates 209.85.212.52 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:in-reply-to:references:from:date
         :x-google-sender-auth:message-id:subject:to:content-type;
        b=Ou4V2vYz4SZ6VrbtZVDF63sQVCItPWhfSa1/OKac6jKCyYDqDalo7ihUtl80UAvejO
         2f4/0e02XElUFhpaH0DXWPIrtTWpE7mtLSScHnwRpya4NMP5DT+mB9bvrl4+AROYLSnR
         LRC4Vomb4n2alnoo9btKRbDHV7l6/zCSaSH/A=
MIME-Version: 1.0
Sender: djc.ochtman@gmail.com
In-Reply-To: <AANLkTi=c-YTZiVi3FAi9vXaj61SxdH577FJ7h_xFprje@mail.gmail.com>
References: <AANLkTins6kbFFz0HgFqxj9GH12UJF0myLPs0oUo4UxH-@mail.gmail.com>
 <AANLkTi=c-YTZiVi3FAi9vXaj61SxdH577FJ7h_xFprje@mail.gmail.com>
From: Dirkjan Ochtman <dirkjan@ochtman.nl>
Date: Wed, 3 Nov 2010 11:35:32 +0100
Message-ID: <AANLkTinaygC4FvGCeRTDwXGZePD4sP1KJi0pNJeeqGOo@mail.gmail.com>
Subject: Re: Using CouchDB to represent the tokenized text of a book
To: user@couchdb.apache.org
Content-Type: text/plain; charset=UTF-8

On Wed, Nov 3, 2010 at 11:16, Weston Ruter <westonruter@gmail.com> wrote:
> I am investigating alternative methods of storing the tokenized text of a
> book in a database. The text is broken up into individual tokens (e.g. word,
> punction mark, etc) and each are assigned a separate ID and exist as
> separate objects. There can be hundreds of thousands of token objects. The
> existing method I've employed in a relational database is to have a table
> Token(id, data, position), where position is an integer used to ensure the
> tokens are rendered in the proper document order. The obvious problem with
> this use of "position" is with insertions and deletions, which causes an
> update to be to be necessary on all subsequent tokens and this is expensive,
> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position
>> old_token_position

A "book" doesn't really sound like something that suffers from a lot
of insertions and deletions in the middle...

> I was hoping that CouchDB with its support for documents containing arrays
> is that I could avoid an explicit position at all and just rely on the
> implicit position each object has in the context of where it lies in the
> token array (this would also facilitate text revisions). In this approach,
> however, it is critical that the entire document not have to be re-loaded
> and re-saved when a change is made (as I imaging this would be even slower
> than SQL UPDATE); I was hoping that an insertion or deletion could be done
> in a patch manner so that they could be done efficiently. But from asking my
> question on Twitter, it appears that the existing approach I took with the
> relational database is also what would be required by CouchDB.

Yeah, CouchDB doesn't support patching documents, so you'd have to
update the whole document. My gut feeling says you don't want a large
document here.

> Is there a more elegant way to store my data set in CouchDB?

It sounds like you want to come up with a kind of index value that
will prevent you from having to update all the documents (but update a
subset indeed), and then use that value as a sorting bucket.

For instance, in your book model, save the page number with the word,
update all the other words on that page that come after it, sort by
page then word order. But this type of idea could work in either
relational databases or CouchDB. If pages are too large still, you
could add paragraphs (or add chapters before pages).

Cheers,

Dirkjan