couchdb-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Anderson <jch...@apache.org>
Subject Re: couchdb search engine
Date Mon, 10 Aug 2009 19:23:29 GMT
On Mon, Aug 10, 2009 at 11:48 AM, Julian Moritz<mailings@julianmoritz.de> wrote:
> Hi,
>
> Chris Anderson schrieb:
>> On Sun, Aug 9, 2009 at 8:20 AM, Julian Moritz<mailings@julianmoritz.de> wrote:
>>> Hi there,
>>>
>>> I am very new to couchdb but highly interested. I work for the
>>> nlp-department of my university and maybe couchdb would be good choice
>>> for a search engine/web crawler storage.
>>>
>>> Is there a project wich implements such a thing as a couchdb?
>>>
>>
>> I first got into CouchDB using it as part of a web-spider. I used
>> Nutch / Hadoop to run the actual crawl (with depth=1, so it was merely
>> fetching all the URLs in a long list I'd give it)
>>
>> Then I'd use Hadoop to run a Ruby job over all the fetched pages,
>> which parsed the HTML / XML / mp3 etc, converting it into a JSON
>> document and putting it in CouchDB.
>>
>> Then I used CouchDB map reduce to find all the inlinks for each page,
>> and do various other kinds of analysis, as well as to find the list of
>> URLs that we learned about in the last crawl that we hadn't fetched
>> yet, for driving the next round of crawl.
>>
>
> okay, as I wrote, I'm studying natural language processing. My
> department has made some experiences with crawling the web. let's break
> it down:
>
> 1st: you need bandwith. so crawling from a single computer is senseless,
>  crawling from a single point is more or less useless, but a distributed
> application for crawling is not a problem, if you got enough people who
> use it.
>
> 2nd: you need even more storage. a high (horizontally) scalable database
> would be helpful.
>
> why couchdb for the 1st point? the client software could be written in
> any language, because data is sent to the storage via tcp.
>
> why couchdb for the 2nd point: well, isn't couchdb that what you need there?
>
> and for fast crawling you need a list with urls wich are randomly
> sorted. so you could extract every urls of every document and you should
> sort them by a random key (done with a view). this is _very_ important:
> fast crawling not sorted randomly would be a DoS attack on some sites.
> adding uniqueness to the list of urls would it make too slow with a big
> list.
>
> and for a fast search you need a special data structure called wordlist.
>
> in each line the first column is the word and the following columns are
> the documents wich contain the word. something like:
>
> house   document_1 document_2
> mouse   document_2 document_3
>
> wich could be done with a simple view.
>
> so everyone could spend some space for storing data, spend some bandwith
> for crawling and everyone could write his/her own website for searching
> the data hence they are represented as json via tcp.
>
> just my thoughts, maybe I'm totally wrong, then please correct me.
>

It sounds like you are on the right track. I think if I were gonna
build a p2p web spider i'd just buckle down and write some Erlang to
handle the URL fetch at least.  I'm not sure how to handle the random
ordering. You could punt on that and just keep an in memory queue for
each host with a per-host throttle.

in erlang you could spawn a supervisor per host to crawl, and it can
schedule fetches.

as far as Couch goes I'd put the couches local to the url fetchers,
and build views locally too, then merge at query time with
couchdb-lounge.

> Best regards
> Julian
>
>> You could do something like this a lot more simply with Disco and
>> CouchDB, I think, but you'd probably end up writing more of the code.
>>
>> Chris
>>
>>
>>
>>
>



-- 
Chris Anderson
http://jchrisa.net
http://couch.io

Mime
View raw message