Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 68036 invoked from network); 3 Nov 2010 21:34:21 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 21:34:21 -0000 Received: (qmail 81024 invoked by uid 500); 3 Nov 2010 21:34:51 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 80844 invoked by uid 500); 3 Nov 2010 21:34:50 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 80836 invoked by uid 99); 3 Nov 2010 21:34:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 21:34:49 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ian.hobson@ntlworld.com designates 81.103.221.47 as permitted sender) Received: from [81.103.221.47] (HELO mtaout01-winn.ispmail.ntl.com) (81.103.221.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 21:34:41 +0000 Received: from aamtaout01-winn.ispmail.ntl.com ([81.103.221.35]) by mtaout01-winn.ispmail.ntl.com (InterMail vM.7.08.04.00 201-2186-134-20080326) with ESMTP id <20101103213410.XNWP25742.mtaout01-winn.ispmail.ntl.com@aamtaout01-winn.ispmail.ntl.com> for ; Wed, 3 Nov 2010 21:34:10 +0000 Received: from [192.168.0.12] (really [86.12.69.109]) by aamtaout01-winn.ispmail.ntl.com (InterMail vG.3.00.04.00 201-2196-133-20080908) with ESMTP id <20101103213410.RBIQ20122.aamtaout01-winn.ispmail.ntl.com@[192.168.0.12]> for ; Wed, 3 Nov 2010 21:34:10 +0000 Message-ID: <4CD1D54A.80004@ntlworld.com> Date: Wed, 03 Nov 2010 21:34:02 +0000 From: Ian Hobson User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101027 Lightning/1.0b2 Mnenhy/0.8.3 Thunderbird/3.1.6 MIME-Version: 1.0 To: user@couchdb.apache.org Subject: Re: Using CouchDB to represent the tokenized text of a book References: <4E746F43-7973-4408-BB4D-2B3672BA9A73@vpro.nl> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Cloudmark-Analysis: v=1.1 cv=3ENABmdyEd/Fm7fR7+mZIuMDn6+IErAeEhlfWBImZFk= c=1 sm=0 a=cNfLtlLPsmQA:10 a=8nJEP1OIZ-IA:10 a=ym-VSCUm-0eZZwdSwqoA:9 a=V8JEUZ-LaTlTqPhZ3HsA:7 a=-LcJ2e15enLjZj4HAkGDJKlWkloA:4 a=wPNLvfGTeEIA:10 a=HpAAvcLHHh0Zw7uRqdWCyQ==:117 On 03/11/2010 16:36, Nils Breunese wrote: > Weston Ruter wrote: > >> Specifically, I'm looking at books that are in a constant flux, i.e. books that are being edited. The application here is for Bible translations in particular, where each word token needs to be keyed into other metadata, like link to source word, insertion datetime, translator, etc. Now that I think of it, in order to be referencable, each token would have to exist as a separate document anyway since parts of documents aren't indexed by ID, I wouldn't think. > That's right. You'll definitely want to use a document per token here. > I'm not sure this is right. It appears most odd to treat a book that is being translated as a sequence of words and symbols. I would expect the translator to translate whole sentences, or paragraphs at a time. For the Bible, isn't the obvious choice the verse? This would imply two document types.... Verses - this contains a list of dictionaries - one for each token. Each dictionary contains the token and the notes about that token. Might use an ordered Dictionary and make the token the key. From this, the source and target texts can be created. Each dictionary can point to lexicon entries and carry translation notes, dates times, translators etc. Lexicon - each entry is the meaning of a word, in the context in which it is used. One entry may be referenced in many many places. Translation notes would record data about inferences and implications to ensure the correct meaning is chosen. I rather suspect that notes about the source or target language words and how they have been translated, would be almost meaningless if separated from the context of the verse. If verses are given a key computed from Book No, Chapter No, and Verse No, then a view that presents the verses in the correct order is trivial to construct. If there are situations where verses need to be re-ordered, then you need two views and two Verse Nos (one for each language) so you can build the correct keys. >> As I mentioned above, metadata and related data are both going to be externally attached to each token at various sources, so each token needs to referenced by ID. This fact alone invalidates a single-document approach because parts of a document can't be linked to, correct? A list of dictionaries that include the token, and data about the token, will avoid this problem. You will have the user interface problem of presenting a verse with words in one order, and receiving it back with new words in a new order. How do you get the program to match up the right notes with the right words? Regards Ian