Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 98329 invoked from network); 3 Nov 2010 15:07:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 15:07:44 -0000 Received: (qmail 36228 invoked by uid 500); 3 Nov 2010 15:08:14 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 36105 invoked by uid 500); 3 Nov 2010 15:08:14 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 36096 invoked by uid 99); 3 Nov 2010 15:08:13 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 15:08:13 +0000 X-ASF-Spam-Status: No, hits=4.7 required=10.0 tests=FREEMAIL_FROM,FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of frederick.bowen@gmail.com designates 209.85.214.52 as permitted sender) Received: from [209.85.214.52] (HELO mail-bw0-f52.google.com) (209.85.214.52) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 15:08:09 +0000 Received: by bwz6 with SMTP id 6so602537bwz.11 for ; Wed, 03 Nov 2010 08:07:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=SfTquUuBOIXgsLY1gKxr0DK45D0sMBs4qN55B+AGzsQ=; b=RGscTD67ii1tO2eWwR+ab9X8tagMSAqm6yumuOZgdkBfuTgSUPsfPvp8lDTqownml6 KxBS3mCiIzr/XNt9s6pO+GyuTH1pUbI0qShMGTCunOIKdtjwcEimc37n7NFONuSXUX7N BrE98XbJZKuxPloeqCyhgHkAjcjAr6LsjCnJw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=R6pFVEfGCcqmKlob+MXhvnFP4ow8jKlaIW4gQYk/i6wAb4GgLq9cVIVQ5VREB9tOKy q2+XEeQUHBAav4WL6clPQIg5wX2biHY/z8FOAMfZdbD/HWTrL6VrTIw6PhlNhU56i0nH vRhDvrebwP3J6B7uSCgNjT5vjjAhduszt9gDg= Received: by 10.204.52.193 with SMTP id j1mr6569966bkg.52.1288796865980; Wed, 03 Nov 2010 08:07:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.204.101.199 with HTTP; Wed, 3 Nov 2010 08:07:25 -0700 (PDT) In-Reply-To: <4CD174EB.10006@gmail.com> References: <4E746F43-7973-4408-BB4D-2B3672BA9A73@vpro.nl> <4CD174EB.10006@gmail.com> From: Freddy Bowen Date: Wed, 3 Nov 2010 11:07:25 -0400 Message-ID: Subject: Re: Using CouchDB to represent the tokenized text of a book To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=001636c5991b626bec049427677b --001636c5991b626bec049427677b Content-Type: text/plain; charset=UTF-8 Beware real numbers: https://issues.apache.org/jira/browse/COUCHDB-227 FB On Wed, Nov 3, 2010 at 10:42 AM, Kevin R. Coombes wrote: > Why not avoid rewriting the subset at all? > > Use position as a real number instead of an integer. An insertion between > positions 100 and 101 can be assigned the position 100.5. Since you can > always find something between any two positions to make an insertion > insertion, you don't have to update anything else. For deletions, you just > have to allow gaps. The real number positions indicate this anyway. You > recover a coherent section of the book (in CouchDB fashion) by specifying > the startkey and endkey on a query based on the (real number) position. > > Kevin > > > On 11/3/2010 8:04 AM, Weston Ruter wrote: > >> Thanks a lot for the replies, Dirkjan and Nils. My replies inline below: >> >> On Wed, Nov 3, 2010 at 11:35 AM, Dirkjan Ochtman >> wrote: >> >> On Wed, Nov 3, 2010 at 11:16, Weston Ruter >>> wrote: >>> >>>> I am investigating alternative methods of storing the tokenized text of >>>> a >>>> book in a database. The text is broken up into individual tokens (e.g. >>>> >>> word, >>> >>>> punction mark, etc) and each are assigned a separate ID and exist as >>>> separate objects. There can be hundreds of thousands of token objects. >>>> >>> The >>> >>>> existing method I've employed in a relational database is to have a >>>> table >>>> Token(id, data, position), where position is an integer used to ensure >>>> >>> the >>> >>>> tokens are rendered in the proper document order. The obvious problem >>>> >>> with >>> >>>> this use of "position" is with insertions and deletions, which causes an >>>> update to be to be necessary on all subsequent tokens and this is >>>> >>> expensive, >>> >>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE >>>> >>> position >>> >>>> old_token_position >>>>> >>>> A "book" doesn't really sound like something that suffers from a lot >>> of insertions and deletions in the middle... >>> >>> Specifically, I'm looking at books that are in a constant flux, i.e. >> books >> that are being edited. The application here is for Bible translations in >> particular, where each word token needs to be keyed into other metadata, >> like link to source word, insertion datetime, translator, etc. Now that I >> think of it, in order to be referencable, each token would have to exist >> as >> a separate document anyway since parts of documents aren't indexed by ID, >> I >> wouldn't think. >> >> >> >> >> I was hoping that CouchDB with its support for documents containing >>>> >>> arrays >>> >>>> is that I could avoid an explicit position at all and just rely on the >>>> implicit position each object has in the context of where it lies in the >>>> token array (this would also facilitate text revisions). In this >>>> >>> approach, >>> >>>> however, it is critical that the entire document not have to be >>>> re-loaded >>>> and re-saved when a change is made (as I imaging this would be even >>>> >>> slower >>> >>>> than SQL UPDATE); I was hoping that an insertion or deletion could be >>>> >>> done >>> >>>> in a patch manner so that they could be done efficiently. But from >>>> asking >>>> >>> my >>> >>>> question on Twitter, it appears that the existing approach I took with >>>> >>> the >>> >>>> relational database is also what would be required by CouchDB. >>>> >>> Yeah, CouchDB doesn't support patching documents, so you'd have to >>> update the whole document. My gut feeling says you don't want a large >>> document here. >>> >>> Is there a more elegant way to store my data set in CouchDB? >>>> >>> It sounds like you want to come up with a kind of index value that >>> will prevent you from having to update all the documents (but update a >>> subset indeed), and then use that value as a sorting bucket. >>> >>> For instance, in your book model, save the page number with the word, >>> update all the other words on that page that come after it, sort by >>> page then word order. But this type of idea could work in either >>> relational databases or CouchDB. If pages are too large still, you >>> could add paragraphs (or add chapters before pages). >>> >>> That is a good idea, but the problem with Bible translations in >> particular >> is the issue of overlapping hierarchies: like chapter and verse don't >> always >> fall along same divisions as section and paragraph. So the data model I've >> been moving toward is standoff markup, where there is a set of tokens >> (words, punctuation) for the entire book and then a set of structures >> (paragraphs, verses, etc) that refer to the start token and end token, so >> when getting a structure it needs to retrieve all tokens from start to >> end. >> The use of standoff markup and overlapping hierarchies makes your idea of >> using sorting buckets not feasible, I don't think. Thanks for the idea >> though! >> >> >> >> >> Cheers, >>> >>> Dirkjan >>> >>> On Wed, Nov 3, 2010 at 11:46 AM, Nils Breunese >> wrote: >> >> Weston Ruter wrote: >>> >>> I am investigating alternative methods of storing the tokenized text of >>>> a >>>> book in a database. The text is broken up into individual tokens (e.g. >>>> >>> word, >>> >>>> punction mark, etc) and each are assigned a separate ID and exist as >>>> separate objects. There can be hundreds of thousands of token objects. >>>> >>> The >>> >>>> existing method I've employed in a relational database is to have a >>>> table >>>> Token(id, data, position), where position is an integer used to ensure >>>> >>> the >>> >>>> tokens are rendered in the proper document order. The obvious problem >>>> >>> with >>> >>>> this use of "position" is with insertions and deletions, which causes an >>>> update to be to be necessary on all subsequent tokens and this is >>>> >>> expensive, >>> >>>> e.g. after deleting: UPDATE Token SET position = position - 1 WHERE >>>> >>> position >>> >>>> old_token_position >>>>> >>>> I was hoping that CouchDB with its support for documents containing >>>> >>> arrays >>> >>>> is that I could avoid an explicit position at all and just rely on the >>>> implicit position each object has in the context of where it lies in the >>>> token array (this would also facilitate text revisions). In this >>>> >>> approach, >>> >>>> however, it is critical that the entire document not have to be >>>> re-loaded >>>> and re-saved when a change is made (as I imaging this would be even >>>> >>> slower >>> >>>> than SQL UPDATE); I was hoping that an insertion or deletion could be >>>> >>> done >>> >>>> in a patch manner so that they could be done efficiently. But from >>>> asking >>>> >>> my >>> >>>> question on Twitter, it appears that the existing approach I took with >>>> >>> the >>> >>>> relational database is also what would be required by CouchDB. >>>> >>> That is correct. Storing the tokens of a book in an array in a single >>> document would require retrieving, modifying and saving the complete >>> document for a change. Storing the tokens as separate documents with an >>> increasing ID would of course involve the same kind of updating as you >>> are >>> doing in your relational setup. >>> >>> It sounds like a linked list kind of storage scenario, where every token >>> has pointers to the previous and next token, might better fit your needs >>> for >>> reconstructing a book from the tokens. >>> >>> I never thought about using a linked list before for this application, >> good >> idea. It would certainly speed up the update process, but it would make >> retrieving all tokens for a structure between a start token and end very >> slow as there would need to be a separate query for each of the tokens in >> the structure to look up each next token to retrieve. >> >> >> >> >> Is there a more elegant way to store my data set in CouchDB? >>>> >>> If I were to use CouchDB I think I'd use a document per token. I'd test >>> how >>> expensive updating the sequence id's is (using the HTTP bulk document API >>> [0]) and depending on how often sequence updates need to happen I might >>> switch to use a linked list kind of approach. (You could use the same in >>> a >>> relational database of course.) >>> >>> Are you planning on storing more than just the tokens and their order? If >>> not, I'm wondering what the use of storing a book as a list of tokens >>> actually is. Sounds like a plain text file would do the job as well, but >>> I'm >>> sure there is a point. :o) >>> >>> As I mentioned above, metadata and related data are both going to be >> externally attached to each token at various sources, so each token needs >> to referenced by ID. This fact alone invalidates a single-document >> approach >> because parts of a document can't be linked to, correct? >> >> >> >> Note that I am very new to CouchDB and am ignorant of a lot of its >>>> >>> features. >>> >>> The Definitive Guide [1] is a nice read. >>> >>> Thanks for the advice! >> >> >> >> Nils. >>> >>> [0] http://wiki.apache.org/couchdb/HTTP_Bulk_Document_API >>> [1] http://guide.couchdb.org/ >>> ------------------------------------------------------------------------ >>> VPRO >>> phone: +31(0)356712911 >>> e-mail: info@vpro.nl >>> web: www.vpro.nl >>> ------------------------------------------------------------------------ >>> >>> >> >> --001636c5991b626bec049427677b--