Return-Path: Delivered-To: apmail-incubator-couchdb-user-archive@locus.apache.org Received: (qmail 49828 invoked from network); 16 Mar 2008 23:44:51 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 16 Mar 2008 23:44:51 -0000 Received: (qmail 78305 invoked by uid 500); 16 Mar 2008 23:44:48 -0000 Delivered-To: apmail-incubator-couchdb-user-archive@incubator.apache.org Received: (qmail 78280 invoked by uid 500); 16 Mar 2008 23:44:48 -0000 Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: couchdb-user@incubator.apache.org Delivered-To: mailing list couchdb-user@incubator.apache.org Received: (qmail 78271 invoked by uid 99); 16 Mar 2008 23:44:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Mar 2008 16:44:48 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jchris@gmail.com designates 72.14.220.158 as permitted sender) Received: from [72.14.220.158] (HELO fg-out-1718.google.com) (72.14.220.158) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 16 Mar 2008 23:44:09 +0000 Received: by fg-out-1718.google.com with SMTP id 22so4482934fge.26 for ; Sun, 16 Mar 2008 16:44:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; bh=JBicnx6+vPhat2+PVlMj9B0Vghkcewgb6CQSH4YOBIY=; b=fKwObjbBbHSkzqA2B+dFCh0dx4iWVQwltrj4CZeSBVwOZ+Lc40lW6QQTbkwm0n7BxrzwykXqDMEywhj+Kjaf7Mz2VqrmI04/oHQH2VMofzmJ/zRblJs6FXJLqbeQ86iLtcTOFXTvMMOGtSLehVROs02H+yHs8T3dJ9SvfNDHs/g= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth; b=diYPQplHd9L3KGZ1Yif9hzYoqsDtUMerEAUpm2vy+TH7wu8ZanL9gS92kmsRkVrWZoou8xFdpHUcC+xggeK+R+hps8lz/csVy+xGwosrnC8chNXI6UWznl9lGmoKzjEJEcOEyQr3MIu7Bik0shIbso10HvbRGr67SREtefmSsuQ= Received: by 10.86.31.18 with SMTP id e18mr6544376fge.68.1205711058743; Sun, 16 Mar 2008 16:44:18 -0700 (PDT) Received: by 10.86.4.8 with HTTP; Sun, 16 Mar 2008 16:44:18 -0700 (PDT) Message-ID: Date: Sun, 16 Mar 2008 16:44:18 -0700 From: "Chris Anderson" Sender: jchris@gmail.com To: couchdb-user@incubator.apache.org Subject: PUT / POST tradeoffs MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline X-Google-Sender-Auth: 9d0a9e3bcf767ca7 X-Virus-Checked: Checked by ClamAV on apache.org Couchers, I've been diving into CouchDB lately, and seeing how it's a great fit for my application. I've run into some questions about how to record information in an efficient way. Here's a description of what I'm trying to do, and a couple of methods I've come up with. Hopefully someone on the list can give me some insight into how determine what the pros and cons of each approach are. Let's say I'm crawling the web, looking for embeds of YouTube videos on blogs and such. When I come across one, I'll be recording: the YouTube video URL. the URL the video was embedded on. a snippet of the context in which it was found. In the end I'd like to give people the ability to see where their videos are being embedded. E.g. start from a video and find the embeds of it from across the web. I'll be recrawling some blogs quite frequently, so I have this idea about how to avoid duplicate content: I calculate an MD5 hash from the information I want to store in a deterministic way, so processing the same page twice creates identical computed hash values. I use the hash values as document_ids, and PUT the data to CouchDB, with no _rev attribute. CouchDB will rejects the PUTs of duplicate data with a conflict. In my application I just ignore the conflict, as all it means is that I've already put that data there (maybe in an earlier crawl). The alternative approach is to forgo the MD5 hash calculation, and POST the parsed data into CouchDB, creating a new record with an arbitrary id. I imagine that I would end up with a lot of identical data in this case, and it would become the job of the Map/Combine/Reduce process to filter duplicates while creating the lookup indexes. I suppose my question boils down to this: Are there unforeseen costs to building a high percentage of failing PUTs into my application design? It seems like the most elegant way to ensure unique data. But perhaps I am putting too high a premium on unique data - I suppose in the end it depend on the cost to compute a conflict, vs the ongoing cost of calculating reductions across redundant data sets. Thanks for any insights! Chris -- Chris Anderson http://jchris.mfdz.com