lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bruce" <bedoug...@earthlink.net>
Subject crawler questions..
Date Wed, 04 Mar 2009 21:30:01 GMT
Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..

I'm working on a crawling project, crawling a college website, to extract
course/class information. I've built a quick test app in python to crawl the
site. I crawl at the top level, and work my way down to getting the required
course/class schedule. The app works. I can consistently run it and extract
the information.

My issue is now that I have a "basic" app that works, i need to figure out
how I guarantee that I'm correctly crawling the site. How do I know when
I've got an error at a given node/branch, so that the app knows that it's
not going to fetch the underlying branch/nodes of the tree..

How do I know when I have a complete "tree"!

I'm looking for someone, or some group/prof that I can talk to about these
issues. My goal is to eventually look at using nutch/lucene if at all
applicable.

Any pointers, or people, or papers, etc... would be helpful.

Thanks





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message