lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: crawler questions..
Date Wed, 04 Mar 2009 21:41:59 GMT
You might have a look at Droids (  
or Nutch ( and their communities.  They  
are much more focused on crawling (not to say there aren't people here  
who crawl, just saying those projects are (mostly) about crawling)

On Mar 4, 2009, at 4:30 PM, bruce wrote:

> Hi...
> Sorry that this is a bit off track. Ok, maybe way off track!
> But I don't have anyone to bounce this off of..
> I'm working on a crawling project, crawling a college website, to  
> extract
> course/class information. I've built a quick test app in python to  
> crawl the
> site. I crawl at the top level, and work my way down to getting the  
> required
> course/class schedule. The app works. I can consistently run it and  
> extract
> the information.
> My issue is now that I have a "basic" app that works, i need to  
> figure out
> how I guarantee that I'm correctly crawling the site. How do I know  
> when
> I've got an error at a given node/branch, so that the app knows that  
> it's
> not going to fetch the underlying branch/nodes of the tree..
> How do I know when I have a complete "tree"!
> I'm looking for someone, or some group/prof that I can talk to about  
> these
> issues. My goal is to eventually look at using nutch/lucene if at all
> applicable.
> Any pointers, or people, or papers, etc... would be helpful.
> Thanks
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message