Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 79908 invoked from network); 3 Nov 2010 10:35:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 10:35:56 -0000 Received: (qmail 96505 invoked by uid 500); 3 Nov 2010 10:36:25 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 96359 invoked by uid 500); 3 Nov 2010 10:36:22 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 96351 invoked by uid 99); 3 Nov 2010 10:36:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 10:36:21 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of djc.ochtman@gmail.com designates 209.85.212.52 as permitted sender) Received: from [209.85.212.52] (HELO mail-vw0-f52.google.com) (209.85.212.52) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 10:36:13 +0000 Received: by vws15 with SMTP id 15so928779vws.11 for ; Wed, 03 Nov 2010 03:35:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:sender:received :in-reply-to:references:from:date:x-google-sender-auth:message-id :subject:to:content-type; bh=v6Cj8v4wgRcPqjKKqZWLRNzlU3hLzQMRrDUoonaWNcM=; b=ZWhMyryrbg+kXcj5hsViAWLEHxEFG6vx2izXUo9/3/HZLONNQpPIhOLUsR6cJLAe5X RxkZD3BMP+PW2WmwsfhBYM3UpDMQQ19BJEeB5BOj9jl/6N8bDl9zJ+Uu38XC8b+Y2JG7 XWOd6deQi4awnhr/RIHiYLvettDjpscqeV8cQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type; b=Ou4V2vYz4SZ6VrbtZVDF63sQVCItPWhfSa1/OKac6jKCyYDqDalo7ihUtl80UAvejO 2f4/0e02XElUFhpaH0DXWPIrtTWpE7mtLSScHnwRpya4NMP5DT+mB9bvrl4+AROYLSnR LRC4Vomb4n2alnoo9btKRbDHV7l6/zCSaSH/A= Received: by 10.224.41.5 with SMTP id m5mr9858278qae.391.1288780552794; Wed, 03 Nov 2010 03:35:52 -0700 (PDT) MIME-Version: 1.0 Sender: djc.ochtman@gmail.com Received: by 10.229.64.170 with HTTP; Wed, 3 Nov 2010 03:35:32 -0700 (PDT) In-Reply-To: References: From: Dirkjan Ochtman Date: Wed, 3 Nov 2010 11:35:32 +0100 X-Google-Sender-Auth: IM7haMAf22SDe7nNwm42Uhmb0Vc Message-ID: Subject: Re: Using CouchDB to represent the tokenized text of a book To: user@couchdb.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Nov 3, 2010 at 11:16, Weston Ruter wrote: > I am investigating alternative methods of storing the tokenized text of a > book in a database. The text is broken up into individual tokens (e.g. word, > punction mark, etc) and each are assigned a separate ID and exist as > separate objects. There can be hundreds of thousands of token objects. The > existing method I've employed in a relational database is to have a table > Token(id, data, position), where position is an integer used to ensure the > tokens are rendered in the proper document order. The obvious problem with > this use of "position" is with insertions and deletions, which causes an > update to be to be necessary on all subsequent tokens and this is expensive, > e.g. after deleting: UPDATE Token SET position = position - 1 WHERE position >> old_token_position A "book" doesn't really sound like something that suffers from a lot of insertions and deletions in the middle... > I was hoping that CouchDB with its support for documents containing arrays > is that I could avoid an explicit position at all and just rely on the > implicit position each object has in the context of where it lies in the > token array (this would also facilitate text revisions). In this approach, > however, it is critical that the entire document not have to be re-loaded > and re-saved when a change is made (as I imaging this would be even slower > than SQL UPDATE); I was hoping that an insertion or deletion could be done > in a patch manner so that they could be done efficiently. But from asking my > question on Twitter, it appears that the existing approach I took with the > relational database is also what would be required by CouchDB. Yeah, CouchDB doesn't support patching documents, so you'd have to update the whole document. My gut feeling says you don't want a large document here. > Is there a more elegant way to store my data set in CouchDB? It sounds like you want to come up with a kind of index value that will prevent you from having to update all the documents (but update a subset indeed), and then use that value as a sorting bucket. For instance, in your book model, save the page number with the word, update all the other words on that page that come after it, sort by page then word order. But this type of idea could work in either relational databases or CouchDB. If pages are too large still, you could add paragraphs (or add chapters before pages). Cheers, Dirkjan