lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bruce" <>
Subject crawler questions..
Date Wed, 04 Mar 2009 21:30:01 GMT

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information.

My issue is now that I have a "basic" app that works, i need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..

How do I know when I have a complete "tree"!

I'm looking for someone, or some group/prof that I can talk to about these
issues. My goal is to eventually look at using nutch/lucene if at all

Any pointers, or people, or papers, etc... would be helpful.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message