lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Williams <william...@gmail.com>
Subject Re: crawler questions..
Date Thu, 05 Mar 2009 00:02:05 GMT
On Wed, Mar 4, 2009 at 4:41 PM, Grant Ingersoll <gsingers@apache.org> wrote:
> You might have a look at Droids (http://incubator.apache.org/droids/) or
> Nutch (http://lucene.apache.org/nutch) and their communities.  They are much
> more focused on crawling (not to say there aren't people here who crawl,
> just saying those projects are (mostly) about crawling)
>
>
> On Mar 4, 2009, at 4:30 PM, bruce wrote:
>
>> Hi...
>>
>> Sorry that this is a bit off track. Ok, maybe way off track!
>>
>> But I don't have anyone to bounce this off of..
>>
>> I'm working on a crawling project, crawling a college website, to extract
>> course/class information. I've built a quick test app in python to crawl
>> the
>> site. I crawl at the top level, and work my way down to getting the
>> required
>> course/class schedule. The app works. I can consistently run it and
>> extract
>> the information.
>>
>> My issue is now that I have a "basic" app that works, i need to figure out
>> how I guarantee that I'm correctly crawling the site. How do I know when
>> I've got an error at a given node/branch, so that the app knows that it's
>> not going to fetch the underlying branch/nodes of the tree..
>>
>> How do I know when I have a complete "tree"!
>>
>> I'm looking for someone, or some group/prof that I can talk to about these
>> issues. My goal is to eventually look at using nutch/lucene if at all
>> applicable.
>>
>> Any pointers, or people, or papers, etc... would be helpful.

The ir-book[1] also has a section on the subject if you hadn't already
seen it.

--tim

[1] - http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-indexes-1.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message