Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (asf.osuosl.org: local policy)
Reply-To: <bedouglas@earthlink.net>
From: "bruce" <bedouglas@earthlink.net>
To: <java-user@lucene.apache.org>
Subject: lucene...
Date: Wed, 21 Jun 2006 09:21:28 -0700
Message-ID: <0d4b01c6954e$c19da680$0301a8c0@Mesa.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Importance: Normal

hi...

after reading through the docs for lucene/nutch, i'm trying to straighten
out how it all works...

if i want to crawl through a portion of a web site for the purpose of
extracting information, it appears that this would work. however, i'm not
sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm
not going to be doing any query searching, at least not initially...

i'm also trying to understand just what gets returned when i 'crawl' a
portion of a site.. do i get information back in a series of html files.. do
i get a db of information, just what do i get..??

i'm looking at being able to take a given url www.foo.com, and to be able to
crawl through a portion of the site.. need to figure out how to accomplish
this... and once i have the returned information (if it's in a file/txt
format) i'd like to be able to extract certain information based upon the
DOM of the page... if the returned information from the 'crawler' is of a
textfile format, i can easily create a parsing function to go through the
files and generate the information...

can someone provide me with insight as to whether lucene/nutch is the way to
go with this project..

thanks

-bruce
=


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org