incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nils Breunese <N.Breun...@vpro.nl>
Subject Re: Using CouchDB to represent the tokenized text of a book
Date Wed, 03 Nov 2010 10:46:28 GMT
Weston Ruter wrote:

> I am investigating alternative methods of storing the tokenized text of a
> book in a database. The text is broken up into individual tokens (e.g. word,
> punction mark, etc) and each are assigned a separate ID and exist as
> separate objects. There can be hundreds of thousands of token objects. The
> existing method I've employed in a relational database is to have a table
> Token(id, data, position), where position is an integer used to ensure the
> tokens are rendered in the proper document order. The obvious problem with
> this use of "position" is with insertions and deletions, which causes an
> update to be to be necessary on all subsequent tokens and this is expensive,
> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position
>> old_token_position
>
> I was hoping that CouchDB with its support for documents containing arrays
> is that I could avoid an explicit position at all and just rely on the
> implicit position each object has in the context of where it lies in the
> token array (this would also facilitate text revisions). In this approach,
> however, it is critical that the entire document not have to be re-loaded
> and re-saved when a change is made (as I imaging this would be even slower
> than SQL UPDATE); I was hoping that an insertion or deletion could be done
> in a patch manner so that they could be done efficiently. But from asking my
> question on Twitter, it appears that the existing approach I took with the
> relational database is also what would be required by CouchDB.

That is correct. Storing the tokens of a book in an array in a single document would require
retrieving, modifying and saving the complete document for a change. Storing the tokens as
separate documents with an increasing ID would of course involve the same kind of updating
as you are doing in your relational setup.

It sounds like a linked list kind of storage scenario, where every token has pointers to the
previous and next token, might better fit your needs for reconstructing a book from the tokens.

> Is there a more elegant way to store my data set in CouchDB?

If I were to use CouchDB I think I'd use a document per token. I'd test how expensive updating
the sequence id's is (using the HTTP bulk document API [0]) and depending on how often sequence
updates need to happen I might switch to use a linked list kind of approach. (You could use
the same in a relational database of course.)

Are you planning on storing more than just the tokens and their order? If not, I'm wondering
what the use of storing a book as a list of tokens actually is. Sounds like a plain text file
would do the job as well, but I'm sure there is a point. :o)

> Note that I am very new to CouchDB and am ignorant of a lot of its features.

The Definitive Guide [1] is a nice read.

Nils.

[0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
[1] http://guide.couchdb.org/
------------------------------------------------------------------------
 VPRO
 phone:  +31(0)356712911
 e-mail: info@vpro.nl
 web:    www.vpro.nl
------------------------------------------------------------------------

Mime
View raw message