Mailing-List: contact couchdb-user-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: couchdb-user@incubator.apache.org
Received-SPF: pass (athena.apache.org: domain of jchris@gmail.com designates
 72.14.220.158 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:date:from:sender:to:subject:mime-version:content-type:content-transfer-encoding:content-disposition:x-google-sender-auth;
        b=diYPQplHd9L3KGZ1Yif9hzYoqsDtUMerEAUpm2vy+TH7wu8ZanL9gS92kmsRkVrWZoou8xFdpHUcC+xggeK+R+hps8lz/csVy+xGwosrnC8chNXI6UWznl9lGmoKzjEJEcOEyQr3MIu7Bik0shIbso10HvbRGr67SREtefmSsuQ=
Message-ID: <e282921e0803161644k340c335j4f61b5b1b42f3d4f@mail.gmail.com>
Date: Sun, 16 Mar 2008 16:44:18 -0700
From: "Chris Anderson" <jchris@mfdz.com>
Sender: jchris@gmail.com
To: couchdb-user@incubator.apache.org
Subject: PUT / POST tradeoffs
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

Couchers,

I've been diving into CouchDB lately, and seeing how it's a great fit
for my application. I've run into some questions about how to record
information in an efficient way. Here's a description of what I'm
trying to do, and a couple of methods I've come up with. Hopefully
someone on the list can give me some insight into how determine what
the pros and cons of each approach are.

Let's say I'm crawling the web, looking for embeds of YouTube videos
on blogs and such. When I come across one, I'll be recording:

the YouTube video URL.
the URL the video was embedded on.
a snippet of the context in which it was found.

In the end I'd like to give people the ability to see where their
videos are being embedded. E.g. start from a video and find the embeds
of it from across the web.

I'll be recrawling some blogs quite frequently, so I have this idea
about how to avoid duplicate content:

I calculate an MD5 hash from the information I want to store in a
deterministic way, so processing the same page twice creates identical
computed hash values. I use the hash values as document_ids, and PUT
the data to CouchDB, with no _rev attribute. CouchDB will rejects the
PUTs of duplicate data with a conflict. In my application I just
ignore the conflict, as all it means is that I've already put that
data there (maybe in an earlier crawl).

The alternative approach is to forgo the MD5 hash calculation, and
POST the parsed data into CouchDB, creating a new record with an
arbitrary id. I imagine that I would end up with a lot of identical
data in this case, and it would become the job of the
Map/Combine/Reduce process to filter duplicates while creating the
lookup indexes.

I suppose my question boils down to this: Are there unforeseen costs
to building a high percentage of failing PUTs into my application
design? It seems like the most elegant way to ensure unique data. But
perhaps I am putting too high a premium on unique data - I suppose in
the end it depend on the cost to compute a conflict, vs the ongoing
cost of calculating reductions across redundant data sets.

Thanks for any insights!
Chris

-- 
Chris Anderson
http://jchris.mfdz.com