Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 8316 invoked from network); 21 Jun 2006 16:18:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 21 Jun 2006 16:18:35 -0000 Received: (qmail 28616 invoked by uid 500); 21 Jun 2006 16:18:27 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 28559 invoked by uid 500); 21 Jun 2006 16:18:27 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 28479 invoked by uid 99); 21 Jun 2006 16:18:26 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Jun 2006 09:18:26 -0700 X-ASF-Spam-Status: No, hits=0.5 required=10.0 tests=DNS_FROM_RFC_ABUSE X-Spam-Check-By: apache.org Received-SPF: neutral (asf.osuosl.org: local policy) Received: from [68.142.198.205] (HELO smtp106.sbc.mail.mud.yahoo.com) (68.142.198.205) by apache.org (qpsmtpd/0.29) with SMTP; Wed, 21 Jun 2006 09:18:24 -0700 Received: (qmail 85888 invoked from network); 21 Jun 2006 16:18:02 -0000 Received: from unknown (HELO sys2) (shelece@sbcglobal.net@69.107.127.159 with login) by smtp106.sbc.mail.mud.yahoo.com with SMTP; 21 Jun 2006 16:18:02 -0000 Reply-To: From: "bruce" To: Subject: lucene... Date: Wed, 21 Jun 2006 09:21:28 -0700 Message-ID: <0d4b01c6954e$c19da680$0301a8c0@Mesa.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook 9.0, Build 9.0.2910.0 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506 Importance: Normal X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N hi... after reading through the docs for lucene/nutch, i'm trying to straighten out how it all works... if i want to crawl through a portion of a web site for the purpose of extracting information, it appears that this would work. however, i'm not sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm not going to be doing any query searching, at least not initially... i'm also trying to understand just what gets returned when i 'crawl' a portion of a site.. do i get information back in a series of html files.. do i get a db of information, just what do i get..?? i'm looking at being able to take a given url www.foo.com, and to be able to crawl through a portion of the site.. need to figure out how to accomplish this... and once i have the returned information (if it's in a file/txt format) i'd like to be able to extract certain information based upon the DOM of the page... if the returned information from the 'crawler' is of a textfile format, i can easily create a parsing function to go through the files and generate the information... can someone provide me with insight as to whether lucene/nutch is the way to go with this project.. thanks -bruce = --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org