From user-return-13466-apmail-couchdb-user-archive=couchdb.apache.org@couchdb.apache.org Wed Nov 03 10:16:44 2010 Return-Path: Delivered-To: apmail-couchdb-user-archive@www.apache.org Received: (qmail 75887 invoked from network); 3 Nov 2010 10:16:44 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Nov 2010 10:16:44 -0000 Received: (qmail 81273 invoked by uid 500); 3 Nov 2010 10:17:14 -0000 Delivered-To: apmail-couchdb-user-archive@couchdb.apache.org Received: (qmail 80959 invoked by uid 500); 3 Nov 2010 10:17:13 -0000 Mailing-List: contact user-help@couchdb.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@couchdb.apache.org Delivered-To: mailing list user@couchdb.apache.org Received: (qmail 80951 invoked by uid 99); 3 Nov 2010 10:17:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 10:17:12 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of westonruter@gmail.com designates 74.125.82.54 as permitted sender) Received: from [74.125.82.54] (HELO mail-ww0-f54.google.com) (74.125.82.54) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Nov 2010 10:17:08 +0000 Received: by wwb34 with SMTP id 34so442999wwb.23 for ; Wed, 03 Nov 2010 03:16:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:received:in-reply-to :references:from:date:message-id:subject:to:content-type; bh=Bn8ApwlVKlXy1ivZiax4VGO9DOSulGTkBoU8kwE0Jz0=; b=T1PBCf/v0QYf0GL5X0wf0C2gFR8Fkhg/2ITnXyCmW0yp6K/7Dih4OxZuN2LojqFxM/ /2dSDxRNM+19FscCWpLqMhIaY+mGyI0G8YbqAQHmKz21/ANpkKxmJYi0aNNb0ik6LWNq gSx7pOs+nDEb2VvaNogFktDeQrDbNmwzGcRdI= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=eLNkDNCXWK9UhxGYLLcdONGD8uq25W/tvW3PzxJ7+4diflbsHnE51/pkZWNOGA8WEX Mqknn+EijN54jQ6BdVUTu+M9B01bjtYhKGHmZTpF+xS/5ecqmHo24pu4O6gd14g4Nn+Q HodXNxsOOk3w3yW4ObYxTlp+wRmc9+6aICIPA= Received: by 10.216.49.145 with SMTP id x17mr7108768web.55.1288779405986; Wed, 03 Nov 2010 03:16:45 -0700 (PDT) MIME-Version: 1.0 Received: by 10.216.241.70 with HTTP; Wed, 3 Nov 2010 03:16:25 -0700 (PDT) In-Reply-To: References: From: Weston Ruter Date: Wed, 3 Nov 2010 11:16:25 +0100 Message-ID: Subject: Using CouchDB to represent the tokenized text of a book To: user@couchdb.apache.org Content-Type: multipart/alternative; boundary=001485f5ce62b010ee04942356bf --001485f5ce62b010ee04942356bf Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable I am investigating alternative methods of storing the tokenized text of a book in a database. The text is broken up into individual tokens (e.g. word= , punction mark, etc) and each are assigned a separate ID and exist as separate objects. There can be hundreds of thousands of token objects. The existing method I've employed in a relational database is to have a table Token(id, data, position), where position is an integer used to ensure the tokens are rendered in the proper document order. The obvious problem with this use of "position" is with insertions and deletions, which causes an update to be to be necessary on all subsequent tokens and this is expensive= , e.g. after deleting: UPDATE Token SET position =3D position - 1 WHERE posit= ion > old_token_position I was hoping that CouchDB with its support for documents containing arrays is that I could avoid an explicit position at all and just rely on the implicit position each object has in the context of where it lies in the token array (this would also facilitate text revisions). In this approach, however, it is critical that the entire document not have to be re-loaded and re-saved when a change is made (as I imaging this would be even slower than SQL UPDATE); I was hoping that an insertion or deletion could be done in a patch manner so that they could be done efficiently. But from asking m= y question on Twitter, it appears that the existing approach I took with the relational database is also what would be required by CouchDB. Is there a more elegant way to store my data set in CouchDB? Note that I am very new to CouchDB and am ignorant of a lot of its features= . Thanks! Weston Previous conversation on Twitter: So let's say I have a huge CouchDB document, like an object with millions o= f properties. Are updates efficient i.e. can be patches? http://twitter.com/#!/westonruter/status/29401798345 @westonruter you can use an _update function to update that saving wire-transport. But a doc that large sounds like wrong architecture. http://twitter.com/#!/CouchDB/status/29410715036 @CouchDB I'm looking to represent the text of a book, where each word (token) is a discrete object in an ordered set. Best way to represent? http://twitter.com/#!/westonruter/status/29415056023 @westonruter knee-jerk idea: import each word as a separate document and store the position number with it. Use a view to sort. http://twitter.com/#!/CouchDB/status/29415298940 @westonruter but there may be smarter ways to do that, best to ask on the user@couchdb.apache.org mailing list: http://bit.ly/agv3ye http://twitter.com/#!/CouchDB/status/29415412524 @CouchDB That's exactly what I was hoping to avoid=97storing token position= s, and instead just using the objects' implicit array positions). http://twitter.com/#!/westonruter/status/29415975571 @westonruter Documents are atomic, you PUT the whole document each revision= . (Can you use a view on a bunch of smaller documents instead?) http://twitter.com/#!/natevw/status/29415990407 @westonruter got it. the megadoc may work out, but _update still reads it from disk into memory fully, so it's not "ideal". http://twitter.com/#!/CouchDB/status/29416178621 @westonruter I might be the wrong tool for the job. -- But do write the mailing list to see what the others come up with :) http://twitter.com/#!/CouchDB/status/29416215398 @westonruter OTOH if you don't care to make views on the data at all, you could split it into doc attachments (which can be PUT separately). http://twitter.com/#!/natevw/status/29416254893 --=20 Weston Ruter http://weston.ruter.net/ @westonruter - Google Profile --001485f5ce62b010ee04942356bf--