Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 92065 invoked from network); 5 Mar 2009 00:02:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 5 Mar 2009 00:02:42 -0000 Received: (qmail 1122 invoked by uid 500); 5 Mar 2009 00:02:34 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 1098 invoked by uid 500); 5 Mar 2009 00:02:34 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 1087 invoked by uid 99); 5 Mar 2009 00:02:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Mar 2009 16:02:34 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of williamstw@gmail.com designates 209.85.198.230 as permitted sender) Received: from [209.85.198.230] (HELO rv-out-0506.google.com) (209.85.198.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 05 Mar 2009 00:02:25 +0000 Received: by rv-out-0506.google.com with SMTP id k40so5398172rvb.5 for ; Wed, 04 Mar 2009 16:02:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=7wzjrgj78TW90MKXhzCc11R+fZ/wTZ4HJUEbnLwfwKM=; b=xuonOGjfHmVcab+1IMlSNfwTHGaHTKDALFfiu4WwnDPN7wx0zf2UeLZDvf+lxT6Yuy D1PSAsUn1Oq4ERmvz0Ip8csOStS+oJ7vVh88biuUswr+k59XEvFZJe/R6vGBRS+KzK8N +zrFtlBYWrjuLxPy3Ui7z2gVjyg+X5dQhx8+8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=qJZnjCzxjLHXw7QF622opIEUhq1YziBpSRG9UvCxAoqYKmiKGGcOY+Fvl33SSWklZ/ mcSlBj45z0yGOR7txahEhjyMy/E4RtWSj7m6+HDdeGyyOgCjBGjgloIOhOkWeIvRaC1U 1shDhDL2/U1TgrTcqBbUbHeEGrZDWH3vXoHfU= MIME-Version: 1.0 Received: by 10.141.209.6 with SMTP id l6mr235911rvq.192.1236211325387; Wed, 04 Mar 2009 16:02:05 -0800 (PST) In-Reply-To: <0BA554DD-9D7D-4AC1-B8A0-A4026EE38538@apache.org> References: <2c6101c99d10$620df810$0301a8c0@tmesa.com> <0BA554DD-9D7D-4AC1-B8A0-A4026EE38538@apache.org> Date: Wed, 4 Mar 2009 19:02:05 -0500 Message-ID: <499888440903041602g5f658274l24978d67cd35b95d@mail.gmail.com> Subject: Re: crawler questions.. From: Tim Williams To: java-user@lucene.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Wed, Mar 4, 2009 at 4:41 PM, Grant Ingersoll wrote= : > You might have a look at Droids (http://incubator.apache.org/droids/) or > Nutch (http://lucene.apache.org/nutch) and their communities. =A0They are= much > more focused on crawling (not to say there aren't people here who crawl, > just saying those projects are (mostly) about crawling) > > > On Mar 4, 2009, at 4:30 PM, bruce wrote: > >> Hi... >> >> Sorry that this is a bit off track. Ok, maybe way off track! >> >> But I don't have anyone to bounce this off of.. >> >> I'm working on a crawling project, crawling a college website, to extrac= t >> course/class information. I've built a quick test app in python to crawl >> the >> site. I crawl at the top level, and work my way down to getting the >> required >> course/class schedule. The app works. I can consistently run it and >> extract >> the information. >> >> My issue is now that I have a "basic" app that works, i need to figure o= ut >> how I guarantee that I'm correctly crawling the site. How do I know when >> I've got an error at a given node/branch, so that the app knows that it'= s >> not going to fetch the underlying branch/nodes of the tree.. >> >> How do I know when I have a complete "tree"! >> >> I'm looking for someone, or some group/prof that I can talk to about the= se >> issues. My goal is to eventually look at using nutch/lucene if at all >> applicable. >> >> Any pointers, or people, or papers, etc... would be helpful. The ir-book[1] also has a section on the subject if you hadn't already seen it. --tim [1] - http://nlp.stanford.edu/IR-book/html/htmledition/web-crawling-and-ind= exes-1.html --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org