couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Moritz <maili...@julianmoritz.de>
Subject Re: couchdb search engine
Date Mon, 10 Aug 2009 18:48:51 GMT
Hi,

Chris Anderson schrieb:
> On Sun, Aug 9, 2009 at 8:20 AM, Julian Moritz<mailings@julianmoritz.de> wrote:
>> Hi there,
>>
>> I am very new to couchdb but highly interested. I work for the
>> nlp-department of my university and maybe couchdb would be good choice
>> for a search engine/web crawler storage.
>>
>> Is there a project wich implements such a thing as a couchdb?
>>
> 
> I first got into CouchDB using it as part of a web-spider. I used
> Nutch / Hadoop to run the actual crawl (with depth=1, so it was merely
> fetching all the URLs in a long list I'd give it)
> 
> Then I'd use Hadoop to run a Ruby job over all the fetched pages,
> which parsed the HTML / XML / mp3 etc, converting it into a JSON
> document and putting it in CouchDB.
> 
> Then I used CouchDB map reduce to find all the inlinks for each page,
> and do various other kinds of analysis, as well as to find the list of
> URLs that we learned about in the last crawl that we hadn't fetched
> yet, for driving the next round of crawl.
> 

okay, as I wrote, I'm studying natural language processing. My
department has made some experiences with crawling the web. let's break
it down:

1st: you need bandwith. so crawling from a single computer is senseless,
 crawling from a single point is more or less useless, but a distributed
application for crawling is not a problem, if you got enough people who
use it.

2nd: you need even more storage. a high (horizontally) scalable database
would be helpful.

why couchdb for the 1st point? the client software could be written in
any language, because data is sent to the storage via tcp.

why couchdb for the 2nd point: well, isn't couchdb that what you need there?

and for fast crawling you need a list with urls wich are randomly
sorted. so you could extract every urls of every document and you should
sort them by a random key (done with a view). this is _very_ important:
fast crawling not sorted randomly would be a DoS attack on some sites.
adding uniqueness to the list of urls would it make too slow with a big
list.

and for a fast search you need a special data structure called wordlist.

in each line the first column is the word and the following columns are
the documents wich contain the word. something like:

house	document_1 document_2
mouse	document_2 document_3

wich could be done with a simple view.

so everyone could spend some space for storing data, spend some bandwith
for crawling and everyone could write his/her own website for searching
the data hence they are represented as json via tcp.

just my thoughts, maybe I'm totally wrong, then please correct me.

Best regards
Julian

> You could do something like this a lot more simply with Disco and
> CouchDB, I think, but you'd probably end up writing more of the code.
> 
> Chris
> 
> 
> 
> 

Mime
View raw message