Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 67262 invoked from network); 17 Mar 2008 00:42:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 17 Mar 2008 00:42:35 -0000 Received: (qmail 97494 invoked by uid 500); 17 Mar 2008 00:42:32 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 97457 invoked by uid 500); 17 Mar 2008 00:42:32 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 97447 invoked by uid 99); 17 Mar 2008 00:42:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Mar 2008 17:42:32 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of damienkatz@gmail.com designates 66.249.82.236 as permitted sender) Received: from [66.249.82.236] (HELO wx-out-0506.google.com) (66.249.82.236) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 17 Mar 2008 00:41:52 +0000 Received: by wx-out-0506.google.com with SMTP id h30so6602869wxd.21 for ; Sun, 16 Mar 2008 17:42:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to:in-reply-to:content-type:content-transfer-encoding:mime-version:subject:date:references:x-mailer; bh=Qy1ic4hD7wUpyFEIVgPmh64TsYC+FlxPNoMa5uOd5yE=; b=h0YFqRBpQ+cKE9kmkdCsvvg0xMVpvMG2qhtNPtQ9L7+PVDU3BOHUFIDia/0LsMq0zRtX+VcRenGPeO6zVT/p/h4bGJ9LXxgVNtW346I99tytziWDk89+ywcvZhJrDNJRnibLjm8Gewugg3OTwP7N45e3sZXZ9I9vjV9of0Zju9E= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type:content-transfer-encoding:mime-version:subject:date:references:x-mailer; b=x0QD5R2adDNlVXPiOiW4Gi5gwmayJro7IgkPhtRdyFVPasaWuw0gP3W3eJfhUovLOd/9cmbAAB1M6un/hQ4nQ87K7/PWThgSroXZ7broOsab8WcbE0lh45EvoO/EuOewM7ImfhSuE2HJHiFwZvYcJAtyrqHilDnIMxMibb3BCJo= Received: by 10.70.56.10 with SMTP id e10mr15947893wxa.83.1205714522761; Sun, 16 Mar 2008 17:42:02 -0700 (PDT) Received: from ?10.0.1.188? ( [71.68.49.63]) by mx.google.com with ESMTPS id i11sm8531072wxd.8.2008.03.16.17.42.02 (version=TLSv1/SSLv3 cipher=OTHER); Sun, 16 Mar 2008 17:42:02 -0700 (PDT) Message-Id: <0080D71A-A0DC-42CB-A950-44E8308DDFAB@gmail.com> From: Damien Katz To: couchdb-user@incubator.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v919.2) Subject: Re: PUT / POST tradeoffs Date: Sun, 16 Mar 2008 20:42:01 -0400 References: X-Mailer: Apple Mail (2.919.2) X-Virus-Checked: Checked by ClamAV on apache.org On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote: > Couchers, > > I've been diving into CouchDB lately, and seeing how it's a great fit > for my application. I've run into some questions about how to record > information in an efficient way. Here's a description of what I'm > trying to do, and a couple of methods I've come up with. Hopefully > someone on the list can give me some insight into how determine what > the pros and cons of each approach are. > > Let's say I'm crawling the web, looking for embeds of YouTube videos > on blogs and such. When I come across one, I'll be recording: > > the YouTube video URL. > the URL the video was embedded on. > a snippet of the context in which it was found. > > In the end I'd like to give people the ability to see where their > videos are being embedded. E.g. start from a video and find the embeds > of it from across the web. > > I'll be recrawling some blogs quite frequently, so I have this idea > about how to avoid duplicate content: > > I calculate an MD5 hash from the information I want to store in a > deterministic way, so processing the same page twice creates identical > computed hash values. I use the hash values as document_ids, and PUT > the data to CouchDB, with no _rev attribute. CouchDB will rejects the > PUTs of duplicate data with a conflict. In my application I just > ignore the conflict, as all it means is that I've already put that > data there (maybe in an earlier crawl). > > The alternative approach is to forgo the MD5 hash calculation, and > POST the parsed data into CouchDB, creating a new record with an > arbitrary id. I imagine that I would end up with a lot of identical > data in this case, and it would become the job of the > Map/Combine/Reduce process to filter duplicates while creating the > lookup indexes. > > I suppose my question boils down to this: Are there unforeseen costs > to building a high percentage of failing PUTs into my application > design? It seems like the most elegant way to ensure unique data. But > perhaps I am putting too high a premium on unique data - I suppose in > the end it depend on the cost to compute a conflict, vs the ongoing > cost of calculating reductions across redundant data sets. I don't see any problems with this approach. For you purposes, using MD5 hashes of data should work just fine. > > > Thanks for any insights! > Chris > > -- > Chris Anderson > http://jchris.mfdz.com