couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nuno Job" <nunojobpi...@gmail.com>
Subject Re: PUT / POST tradeoffs
Date Mon, 17 Mar 2008 00:46:00 GMT
Totally away of the topic:

If your using the MD5 as CRC why don't you simply use CRC?

Just a thought :)

Have a nice weeend

On Mon, Mar 17, 2008 at 12:42 AM, Damien Katz <damienkatz@gmail.com> wrote:

>
> On Mar 16, 2008, at 7:44 PM, Chris Anderson wrote:
>
> > Couchers,
> >
> > I've been diving into CouchDB lately, and seeing how it's a great fit
> > for my application. I've run into some questions about how to record
> > information in an efficient way. Here's a description of what I'm
> > trying to do, and a couple of methods I've come up with. Hopefully
> > someone on the list can give me some insight into how determine what
> > the pros and cons of each approach are.
> >
> > Let's say I'm crawling the web, looking for embeds of YouTube videos
> > on blogs and such. When I come across one, I'll be recording:
> >
> > the YouTube video URL.
> > the URL the video was embedded on.
> > a snippet of the context in which it was found.
> >
> > In the end I'd like to give people the ability to see where their
> > videos are being embedded. E.g. start from a video and find the embeds
> > of it from across the web.
> >
> > I'll be recrawling some blogs quite frequently, so I have this idea
> > about how to avoid duplicate content:
> >
> > I calculate an MD5 hash from the information I want to store in a
> > deterministic way, so processing the same page twice creates identical
> > computed hash values. I use the hash values as document_ids, and PUT
> > the data to CouchDB, with no _rev attribute. CouchDB will rejects the
> > PUTs of duplicate data with a conflict. In my application I just
> > ignore the conflict, as all it means is that I've already put that
> > data there (maybe in an earlier crawl).
> >
> > The alternative approach is to forgo the MD5 hash calculation, and
> > POST the parsed data into CouchDB, creating a new record with an
> > arbitrary id. I imagine that I would end up with a lot of identical
> > data in this case, and it would become the job of the
> > Map/Combine/Reduce process to filter duplicates while creating the
> > lookup indexes.
> >
> > I suppose my question boils down to this: Are there unforeseen costs
> > to building a high percentage of failing PUTs into my application
> > design? It seems like the most elegant way to ensure unique data. But
> > perhaps I am putting too high a premium on unique data - I suppose in
> > the end it depend on the cost to compute a conflict, vs the ongoing
> > cost of calculating reductions across redundant data sets.
>
> I don't see any problems with this approach. For you purposes, using
> MD5 hashes of data should work just fine.
>
> >
> >
> > Thanks for any insights!
> > Chris
> >
> > --
> > Chris Anderson
> > http://jchris.mfdz.com
>
>


-- 
Nuno Job
IBM DB2 Student Ambassador [ http://caos.di.uminho.pt/~db2]
Open Source Support Center Member [http://caos.di.uminho.pt]
Blog [ http://nunojob.wordpress.com ] LinkedIn [
http://www.linkedin.com/in/njpinto]

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message