couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Hobson <>
Subject Re: Using CouchDB to represent the tokenized text of a book
Date Wed, 03 Nov 2010 21:34:02 GMT
On 03/11/2010 16:36, Nils Breunese wrote:
> Weston Ruter wrote:
>> Specifically, I'm looking at books that are in a constant flux, i.e. books that are
being edited. The application here is for Bible translations in particular, where each word
token needs to be keyed into other metadata, like link to source word, insertion datetime,
translator, etc. Now that I think of it, in order to be referencable, each token would have
to exist as a separate document anyway since parts of documents aren't indexed by ID, I wouldn't
> That's right. You'll definitely want to use a document per token here.
I'm not sure this is right. It appears most odd to treat a book that is 
being translated as a sequence of words and symbols.  I would expect the 
translator to translate whole sentences, or paragraphs at a time.  For 
the Bible, isn't the obvious choice the verse? This would imply two 
document types....

Verses -  this contains a list of dictionaries - one for each token. 
Each dictionary contains the token and the notes about that token. Might 
use an ordered Dictionary and make the token the key. From this, the 
source and target texts can be created.  Each dictionary can point to 
lexicon entries  and carry translation notes, dates times, translators etc.

Lexicon - each entry is the meaning of a word, in the context in which 
it is used. One entry may be referenced in many many places.  
Translation notes would record data about inferences and implications to 
ensure the correct meaning is chosen.

I rather suspect that notes about the source or target language words 
and how they have been translated, would be almost meaningless if 
separated from the context of the verse.

If verses are given a key computed from Book No, Chapter No, and Verse 
No, then a view that presents the verses in the correct order is trivial 
to construct. If there are situations where verses need to be 
re-ordered, then you need two views and two Verse Nos (one for each 
language) so you can build the correct keys.

>> As I mentioned above, metadata and related data are both going to be externally attached
to each token at various sources, so each token needs to  referenced by ID. This fact alone
invalidates a single-document approach because parts of a document can't be linked to, correct?
A list of dictionaries that include the token, and data about the token, 
will avoid this problem.

You will have the user interface problem of presenting a verse with 
words in one order, and receiving it back with new words in a new order. 
How do you get the program to match up the right notes with the right 



View raw message