couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: couchdb search engine
Date Mon, 10 Aug 2009 15:25:12 GMT
On Sun, Aug 9, 2009 at 8:20 AM, Julian Moritz<mailings@julianmoritz.de> wrote:
> Hi there,
>
> I am very new to couchdb but highly interested. I work for the
> nlp-department of my university and maybe couchdb would be good choice
> for a search engine/web crawler storage.
>
> Is there a project wich implements such a thing as a couchdb?
>

I first got into CouchDB using it as part of a web-spider. I used
Nutch / Hadoop to run the actual crawl (with depth=1, so it was merely
fetching all the URLs in a long list I'd give it)

Then I'd use Hadoop to run a Ruby job over all the fetched pages,
which parsed the HTML / XML / mp3 etc, converting it into a JSON
document and putting it in CouchDB.

Then I used CouchDB map reduce to find all the inlinks for each page,
and do various other kinds of analysis, as well as to find the list of
URLs that we learned about in the last crawl that we hadn't fetched
yet, for driving the next round of crawl.

You could do something like this a lot more simply with Disco and
CouchDB, I think, but you'd probably end up writing more of the code.

Chris




-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message