incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin R. Coombes" <kevin.r.coom...@gmail.com>
Subject Re: Using CouchDB to represent the tokenized text of a book
Date Wed, 03 Nov 2010 14:42:51 GMT
  Why not avoid rewriting the subset at all?

Use position as a real number instead of an integer.  An insertion 
between positions 100 and 101 can be assigned the position 100.5.  Since 
you can always find something between any two positions to make an 
insertion insertion, you don't have to update anything else.  For 
deletions, you just have to allow gaps.  The real number positions 
indicate this anyway.  You recover a coherent section of the book (in 
CouchDB fashion) by specifying the startkey and endkey on a query based 
on the (real number) position.

     Kevin

On 11/3/2010 8:04 AM, Weston Ruter wrote:
> Thanks a lot for the replies, Dirkjan and Nils. My replies inline below:
>
> On Wed, Nov 3, 2010 at 11:35 AM, Dirkjan Ochtman<dirkjan@ochtman.nl>  wrote:
>
>> On Wed, Nov 3, 2010 at 11:16, Weston Ruter<westonruter@gmail.com>  wrote:
>>> I am investigating alternative methods of storing the tokenized text of a
>>> book in a database. The text is broken up into individual tokens (e.g.
>> word,
>>> punction mark, etc) and each are assigned a separate ID and exist as
>>> separate objects. There can be hundreds of thousands of token objects.
>> The
>>> existing method I've employed in a relational database is to have a table
>>> Token(id, data, position), where position is an integer used to ensure
>> the
>>> tokens are rendered in the proper document order. The obvious problem
>> with
>>> this use of "position" is with insertions and deletions, which causes an
>>> update to be to be necessary on all subsequent tokens and this is
>> expensive,
>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE
>> position
>>>> old_token_position
>> A "book" doesn't really sound like something that suffers from a lot
>> of insertions and deletions in the middle...
>>
> Specifically, I'm looking at books that are in a constant flux, i.e. books
> that are being edited. The application here is for Bible translations in
> particular, where each word token needs to be keyed into other metadata,
> like link to source word, insertion datetime, translator, etc. Now that I
> think of it, in order to be referencable, each token would have to exist as
> a separate document anyway since parts of documents aren't indexed by ID, I
> wouldn't think.
>
>
>
>
>>> I was hoping that CouchDB with its support for documents containing
>> arrays
>>> is that I could avoid an explicit position at all and just rely on the
>>> implicit position each object has in the context of where it lies in the
>>> token array (this would also facilitate text revisions). In this
>> approach,
>>> however, it is critical that the entire document not have to be re-loaded
>>> and re-saved when a change is made (as I imaging this would be even
>> slower
>>> than SQL UPDATE); I was hoping that an insertion or deletion could be
>> done
>>> in a patch manner so that they could be done efficiently. But from asking
>> my
>>> question on Twitter, it appears that the existing approach I took with
>> the
>>> relational database is also what would be required by CouchDB.
>> Yeah, CouchDB doesn't support patching documents, so you'd have to
>> update the whole document. My gut feeling says you don't want a large
>> document here.
>>
>>> Is there a more elegant way to store my data set in CouchDB?
>> It sounds like you want to come up with a kind of index value that
>> will prevent you from having to update all the documents (but update a
>> subset indeed), and then use that value as a sorting bucket.
>>
>> For instance, in your book model, save the page number with the word,
>> update all the other words on that page that come after it, sort by
>> page then word order. But this type of idea could work in either
>> relational databases or CouchDB. If pages are too large still, you
>> could add paragraphs (or add chapters before pages).
>>
> That is a good idea, but the problem with Bible translations in particular
> is the issue of overlapping hierarchies: like chapter and verse don't always
> fall along same divisions as section and paragraph. So the data model I've
> been moving toward is standoff markup, where there is a set of tokens
> (words, punctuation) for the entire book and then a set of structures
> (paragraphs, verses, etc) that refer to the start token and end token, so
> when getting a structure it needs to retrieve all tokens from start to end.
> The use of standoff markup and overlapping hierarchies makes your idea of
> using sorting buckets not feasible, I don't think. Thanks for the idea
> though!
>
>
>
>
>> Cheers,
>>
>> Dirkjan
>>
> On Wed, Nov 3, 2010 at 11:46 AM, Nils Breunese<N.Breunese@vpro.nl>  wrote:
>
>> Weston Ruter wrote:
>>
>>> I am investigating alternative methods of storing the tokenized text of a
>>> book in a database. The text is broken up into individual tokens (e.g.
>> word,
>>> punction mark, etc) and each are assigned a separate ID and exist as
>>> separate objects. There can be hundreds of thousands of token objects.
>> The
>>> existing method I've employed in a relational database is to have a table
>>> Token(id, data, position), where position is an integer used to ensure
>> the
>>> tokens are rendered in the proper document order. The obvious problem
>> with
>>> this use of "position" is with insertions and deletions, which causes an
>>> update to be to be necessary on all subsequent tokens and this is
>> expensive,
>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE
>> position
>>>> old_token_position
>>> I was hoping that CouchDB with its support for documents containing
>> arrays
>>> is that I could avoid an explicit position at all and just rely on the
>>> implicit position each object has in the context of where it lies in the
>>> token array (this would also facilitate text revisions). In this
>> approach,
>>> however, it is critical that the entire document not have to be re-loaded
>>> and re-saved when a change is made (as I imaging this would be even
>> slower
>>> than SQL UPDATE); I was hoping that an insertion or deletion could be
>> done
>>> in a patch manner so that they could be done efficiently. But from asking
>> my
>>> question on Twitter, it appears that the existing approach I took with
>> the
>>> relational database is also what would be required by CouchDB.
>> That is correct. Storing the tokens of a book in an array in a single
>> document would require retrieving, modifying and saving the complete
>> document for a change. Storing the tokens as separate documents with an
>> increasing ID would of course involve the same kind of updating as you are
>> doing in your relational setup.
>>
>> It sounds like a linked list kind of storage scenario, where every token
>> has pointers to the previous and next token, might better fit your needs for
>> reconstructing a book from the tokens.
>>
> I never thought about using a linked list before for this application, good
> idea. It would certainly speed up the update process, but it would make
> retrieving all tokens for a structure between a start token and end very
> slow as there would need to be a separate query for each of the tokens in
> the structure to look up each next token to retrieve.
>
>
>
>
>>> Is there a more elegant way to store my data set in CouchDB?
>> If I were to use CouchDB I think I'd use a document per token. I'd test how
>> expensive updating the sequence id's is (using the HTTP bulk document API
>> [0]) and depending on how often sequence updates need to happen I might
>> switch to use a linked list kind of approach. (You could use the same in a
>> relational database of course.)
>>
>> Are you planning on storing more than just the tokens and their order? If
>> not, I'm wondering what the use of storing a book as a list of tokens
>> actually is. Sounds like a plain text file would do the job as well, but I'm
>> sure there is a point. :o)
>>
> As I mentioned above, metadata and related data are both going to be
> externally attached to each token at various sources, so each token needs
> to  referenced by ID. This fact alone invalidates a single-document approach
> because parts of a document can't be linked to, correct?
>
>
>
>>> Note that I am very new to CouchDB and am ignorant of a lot of its
>> features.
>>
>> The Definitive Guide [1] is a nice read.
>>
> Thanks for the advice!
>
>
>
>> Nils.
>>
>> [0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API
>> [1] http://guide.couchdb.org/
>> ------------------------------------------------------------------------
>>   VPRO
>>   phone:  +31(0)356712911
>>   e-mail: info@vpro.nl
>>   web:    www.vpro.nl
>> ------------------------------------------------------------------------
>>
>
>

Mime
View raw message