incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Damien Katz <damienk...@gmail.com>
Subject Re: PUT / POST tradeoffs
Date Mon, 17 Mar 2008 00:42:01 GMT

On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote:

> Couchers,
>
> I've been diving into CouchDB lately, and seeing how it's a great fit
> for my application. I've run into some questions about how to record
> information in an efficient way. Here's a description of what I'm
> trying to do, and a couple of methods I've come up with. Hopefully
> someone on the list can give me some insight into how determine what
> the pros and cons of each approach are.
>
> Let's say I'm crawling the web, looking for embeds of YouTube videos
> on blogs and such. When I come across one, I'll be recording:
>
> the YouTube video URL.
> the URL the video was embedded on.
> a snippet of the context in which it was found.
>
> In the end I'd like to give people the ability to see where their
> videos are being embedded. E.g. start from a video and find the embeds
> of it from across the web.
>
> I'll be recrawling some blogs quite frequently, so I have this idea
> about how to avoid duplicate content:
>
> I calculate an MD5 hash from the information I want to store in a
> deterministic way, so processing the same page twice creates identical
> computed hash values. I use the hash values as document_ids, and PUT
> the data to CouchDB, with no _rev attribute. CouchDB will rejects the
> PUTs of duplicate data with a conflict. In my application I just
> ignore the conflict, as all it means is that I've already put that
> data there (maybe in an earlier crawl).
>
> The alternative approach is to forgo the MD5 hash calculation, and
> POST the parsed data into CouchDB, creating a new record with an
> arbitrary id. I imagine that I would end up with a lot of identical
> data in this case, and it would become the job of the
> Map/Combine/Reduce process to filter duplicates while creating the
> lookup indexes.
>
> I suppose my question boils down to this: Are there unforeseen costs
> to building a high percentage of failing PUTs into my application
> design? It seems like the most elegant way to ensure unique data. But
> perhaps I am putting too high a premium on unique data - I suppose in
> the end it depend on the cost to compute a conflict, vs the ongoing
> cost of calculating reductions across redundant data sets.

I don't see any problems with this approach. For you purposes, using  
MD5 hashes of data should work just fine.

>
>
> Thanks for any insights!
> Chris
>
> -- 
> Chris Anderson
> http://jchris.mfdz.com


Mime
View raw message