Weston Ruter wrote:
> I am investigating alternative methods of storing the tokenized text of a
> book in a database. The text is broken up into individual tokens (e.g. word,
> punction mark, etc) and each are assigned a separate ID and exist as
> separate objects. There can be hundreds of thousands of token objects. The
> existing method I've employed in a relational database is to have a table
> Token(id, data, position), where position is an integer used to ensure the
> tokens are rendered in the proper document order. The obvious problem with
> this use of "position" is with insertions and deletions, which causes an
> update to be to be necessary on all subsequent tokens and this is expensive,
> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position
>> old_token_position
>
> I was hoping that CouchDB with its support for documents containing arrays
> is that I could avoid an explicit position at all and just rely on the
> implicit position each object has in the context of where it lies in the
> token array (this would also facilitate text revisions). In this approach,
> however, it is critical that the entire document not have to be re-loaded
> and re-saved when a change is made (as I imaging this would be even slower
> than SQL UPDATE); I was hoping that an insertion or deletion could be done
> in a patch manner so that they could be done efficiently. But from asking my
> question on Twitter, it appears that the existing approach I took with the
> relational database is also what would be required by CouchDB.
That is correct. Storing the tokens of a book in an array in a single document would require
retrieving, modifying and saving the complete document for a change. Storing the tokens as
separate documents with an increasing ID would of course involve the same kind of updating
as you are doing in your relational setup.
It sounds like a linked list kind of storage scenario, where every token has pointers to the
previous and next token, might better fit your needs for reconstructing a book from the tokens.
> Is there a more elegant way to store my data set in CouchDB?
If I were to use CouchDB I think I'd use a document per token. I'd test how expensive updating
the sequence id's is (using the HTTP bulk document API [0]) and depending on how often sequence
updates need to happen I might switch to use a linked list kind of approach. (You could use
the same in a relational database of course.)
Are you planning on storing more than just the tokens and their order? If not, I'm wondering
what the use of storing a book as a list of tokens actually is. Sounds like a plain text file
would do the job as well, but I'm sure there is a point. :o)
> Note that I am very new to CouchDB and am ignorant of a lot of its features.
The Definitive Guide [1] is a nice read.
Nils.
[0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
[1] http://guide.couchdb.org/
------------------------------------------------------------------------
VPRO
phone: +31(0)356712911
e-mail: info@vpro.nl
web: www.vpro.nl
------------------------------------------------------------------------
|