couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Anderson" <>
Subject PUT / POST tradeoffs
Date Sun, 16 Mar 2008 23:44:18 GMT

I've been diving into CouchDB lately, and seeing how it's a great fit
for my application. I've run into some questions about how to record
information in an efficient way. Here's a description of what I'm
trying to do, and a couple of methods I've come up with. Hopefully
someone on the list can give me some insight into how determine what
the pros and cons of each approach are.

Let's say I'm crawling the web, looking for embeds of YouTube videos
on blogs and such. When I come across one, I'll be recording:

the YouTube video URL.
the URL the video was embedded on.
a snippet of the context in which it was found.

In the end I'd like to give people the ability to see where their
videos are being embedded. E.g. start from a video and find the embeds
of it from across the web.

I'll be recrawling some blogs quite frequently, so I have this idea
about how to avoid duplicate content:

I calculate an MD5 hash from the information I want to store in a
deterministic way, so processing the same page twice creates identical
computed hash values. I use the hash values as document_ids, and PUT
the data to CouchDB, with no _rev attribute. CouchDB will rejects the
PUTs of duplicate data with a conflict. In my application I just
ignore the conflict, as all it means is that I've already put that
data there (maybe in an earlier crawl).

The alternative approach is to forgo the MD5 hash calculation, and
POST the parsed data into CouchDB, creating a new record with an
arbitrary id. I imagine that I would end up with a lot of identical
data in this case, and it would become the job of the
Map/Combine/Reduce process to filter duplicates while creating the
lookup indexes.

I suppose my question boils down to this: Are there unforeseen costs
to building a high percentage of failing PUTs into my application
design? It seems like the most elegant way to ensure unique data. But
perhaps I am putting too high a premium on unique data - I suppose in
the end it depend on the cost to compute a conflict, vs the ongoing
cost of calculating reductions across redundant data sets.

Thanks for any insights!

Chris Anderson

View raw message