Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@couchdb.apache.org
Received-SPF: pass (athena.apache.org: domain of ian.hobson@ntlworld.com
 designates 81.103.221.47 as permitted sender)
Message-ID: <4CD1D54A.80004@ntlworld.com>
Date: Wed, 03 Nov 2010 21:34:02 +0000
From: Ian Hobson <ian.hobson@ntlworld.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB;
 rv:1.9.2.12) Gecko/20101027 Lightning/1.0b2 Mnenhy/0.8.3 Thunderbird/3.1.6
MIME-Version: 1.0
To: user@couchdb.apache.org
Subject: Re: Using CouchDB to represent the tokenized text of a book
References: <AANLkTins6kbFFz0HgFqxj9GH12UJF0myLPs0oUo4UxH-@mail.gmail.com>
 <AANLkTi=c-YTZiVi3FAi9vXaj61SxdH577FJ7h_xFprje@mail.gmail.com>
 <4E746F43-7973-4408-BB4D-2B3672BA9A73@vpro.nl>
 <AANLkTi=rGFJfM-9==Rhod7i+k4CZySP+vvneax0+De-Y@mail.gmail.com>
 <CBAFA60D-0F35-45C2-B9BC-87373D2BBB3A@vpro.nl>
In-Reply-To: <CBAFA60D-0F35-45C2-B9BC-87373D2BBB3A@vpro.nl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

On 03/11/2010 16:36, Nils Breunese wrote:
> Weston Ruter wrote:
>
>> Specifically, I'm looking at books that are in a constant flux, i.e. books that are being edited. The application here is for Bible translations in particular, where each word token needs to be keyed into other metadata, like link to source word, insertion datetime, translator, etc. Now that I think of it, in order to be referencable, each token would have to exist as a separate document anyway since parts of documents aren't indexed by ID, I wouldn't think.
> That's right. You'll definitely want to use a document per token here.
>
I'm not sure this is right. It appears most odd to treat a book that is 
being translated as a sequence of words and symbols.  I would expect the 
translator to translate whole sentences, or paragraphs at a time.  For 
the Bible, isn't the obvious choice the verse? This would imply two 
document types....

Verses -  this contains a list of dictionaries - one for each token. 
Each dictionary contains the token and the notes about that token. Might 
use an ordered Dictionary and make the token the key. From this, the 
source and target texts can be created.  Each dictionary can point to 
lexicon entries  and carry translation notes, dates times, translators etc.

Lexicon - each entry is the meaning of a word, in the context in which 
it is used. One entry may be referenced in many many places.  
Translation notes would record data about inferences and implications to 
ensure the correct meaning is chosen.

I rather suspect that notes about the source or target language words 
and how they have been translated, would be almost meaningless if 
separated from the context of the verse.

If verses are given a key computed from Book No, Chapter No, and Verse 
No, then a view that presents the verses in the correct order is trivial 
to construct. If there are situations where verses need to be 
re-ordered, then you need two views and two Verse Nos (one for each 
language) so you can build the correct keys.

>> As I mentioned above, metadata and related data are both going to be externally attached to each token at various sources, so each token needs to  referenced by ID. This fact alone invalidates a single-document approach because parts of a document can't be linked to, correct?
A list of dictionaries that include the token, and data about the token, 
will avoid this problem.

You will have the user interface problem of presenting a verse with 
words in one order, and receiving it back with new words in a new order. 
How do you get the program to match up the right notes with the right 
words?

Regards

Ian