Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 9923 invoked from network); 4 Mar 2009 21:42:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Mar 2009 21:42:50 -0000 Received: (qmail 75201 invoked by uid 500); 4 Mar 2009 21:42:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 75173 invoked by uid 500); 4 Mar 2009 21:42:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 75156 invoked by uid 99); 4 Mar 2009 21:42:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2009 13:42:33 -0800 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [208.97.132.81] (HELO spunkymail-a6.g.dreamhost.com) (208.97.132.81) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2009 21:42:23 +0000 Received: from [192.168.0.3] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a6.g.dreamhost.com (Postfix) with ESMTP id C5DFD109F2F for ; Wed, 4 Mar 2009 13:42:00 -0800 (PST) Message-Id: <0BA554DD-9D7D-4AC1-B8A0-A4026EE38538@apache.org> From: Grant Ingersoll To: java-user@lucene.apache.org In-Reply-To: <2c6101c99d10$620df810$0301a8c0@tmesa.com> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v930.3) Subject: Re: crawler questions.. X-Priority: 3 (Normal) Date: Wed, 4 Mar 2009 16:41:59 -0500 References: <2c6101c99d10$620df810$0301a8c0@tmesa.com> X-Mailer: Apple Mail (2.930.3) X-Virus-Checked: Checked by ClamAV on apache.org You might have a look at Droids (http://incubator.apache.org/droids/) or Nutch (http://lucene.apache.org/nutch) and their communities. They are much more focused on crawling (not to say there aren't people here who crawl, just saying those projects are (mostly) about crawling) On Mar 4, 2009, at 4:30 PM, bruce wrote: > Hi... > > Sorry that this is a bit off track. Ok, maybe way off track! > > But I don't have anyone to bounce this off of.. > > I'm working on a crawling project, crawling a college website, to > extract > course/class information. I've built a quick test app in python to > crawl the > site. I crawl at the top level, and work my way down to getting the > required > course/class schedule. The app works. I can consistently run it and > extract > the information. > > My issue is now that I have a "basic" app that works, i need to > figure out > how I guarantee that I'm correctly crawling the site. How do I know > when > I've got an error at a given node/branch, so that the app knows that > it's > not going to fetch the underlying branch/nodes of the tree.. > > How do I know when I have a complete "tree"! > > I'm looking for someone, or some group/prof that I can talk to about > these > issues. My goal is to eventually look at using nutch/lucene if at all > applicable. > > Any pointers, or people, or papers, etc... would be helpful. > > Thanks > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org