incubator-couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Anderson" <jch...@mfdz.com>
Subject Re: PUT / POST tradeoffs
Date Mon, 17 Mar 2008 01:11:56 GMT
Nuno,

I guess the relative cost vs assurance of no collisions make the MD5
worth it for me, especially as it is computed at the edges, so it
should parallelize well. Perhaps I am underestimating the power of CRC
for these purposes...


Thanks, Damien for the assurances.

Chris

On Sun, Mar 16, 2008 at 5:46 PM, Nuno Job <nunojobpinto@gmail.com> wrote:
> Totally away of the topic:
>
>  If your using the MD5 as CRC why don't you simply use CRC?
>
>  Just a thought :)
>
>  Have a nice weeend
>
>
>
>  On Mon, Mar 17, 2008 at 12:42 AM, Damien Katz <damienkatz@gmail.com> wrote:
>
>  >
>  > On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote:
>  >
>  > > Couchers,
>  > >
>  > > I've been diving into CouchDB lately, and seeing how it's a great fit
>  > > for my application. I've run into some questions about how to record
>  > > information in an efficient way. Here's a description of what I'm
>  > > trying to do, and a couple of methods I've come up with. Hopefully
>  > > someone on the list can give me some insight into how determine what
>  > > the pros and cons of each approach are.
>  > >
>  > > Let's say I'm crawling the web, looking for embeds of YouTube videos
>  > > on blogs and such. When I come across one, I'll be recording:
>  > >
>  > > the YouTube video URL.
>  > > the URL the video was embedded on.
>  > > a snippet of the context in which it was found.
>  > >
>  > > In the end I'd like to give people the ability to see where their
>  > > videos are being embedded. E.g. start from a video and find the embeds
>  > > of it from across the web.
>  > >
>  > > I'll be recrawling some blogs quite frequently, so I have this idea
>  > > about how to avoid duplicate content:
>  > >
>  > > I calculate an MD5 hash from the information I want to store in a
>  > > deterministic way, so processing the same page twice creates identical
>  > > computed hash values. I use the hash values as document_ids, and PUT
>  > > the data to CouchDB, with no _rev attribute. CouchDB will rejects the
>  > > PUTs of duplicate data with a conflict. In my application I just
>  > > ignore the conflict, as all it means is that I've already put that
>  > > data there (maybe in an earlier crawl).
>  > >
>  > > The alternative approach is to forgo the MD5 hash calculation, and
>  > > POST the parsed data into CouchDB, creating a new record with an
>  > > arbitrary id. I imagine that I would end up with a lot of identical
>  > > data in this case, and it would become the job of the
>  > > Map/Combine/Reduce process to filter duplicates while creating the
>  > > lookup indexes.
>  > >
>  > > I suppose my question boils down to this: Are there unforeseen costs
>  > > to building a high percentage of failing PUTs into my application
>  > > design? It seems like the most elegant way to ensure unique data. But
>  > > perhaps I am putting too high a premium on unique data - I suppose in
>  > > the end it depend on the cost to compute a conflict, vs the ongoing
>  > > cost of calculating reductions across redundant data sets.
>  >
>  > I don't see any problems with this approach. For you purposes, using
>  > MD5 hashes of data should work just fine.
>  >
>  > >
>  > >
>  > > Thanks for any insights!
>  > > Chris
>  > >
>  > > --
>  > > Chris Anderson
>  > > http://jchris.mfdz.com
>  >
>  >
>
>
>  --
>  Nuno Job
>  IBM DB2 Student Ambassador [ http://caos.di.uminho.pt/~db2]
>  Open Source Support Center Member [http://caos.di.uminho.pt]
>  Blog [ http://nunojob.wordpress.com ] LinkedIn [
>  http://www.linkedin.com/in/njpinto]
>



-- 
Chris Anderson
http://jchris.mfdz.com

Mime
View raw message