lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bruce" <bedoug...@earthlink.net>
Subject lucene...
Date Wed, 21 Jun 2006 16:21:28 GMT
hi...

after reading through the docs for lucene/nutch, i'm trying to straighten
out how it all works...

if i want to crawl through a portion of a web site for the purpose of
extracting information, it appears that this would work. however, i'm not
sure if i need lucene/nutch or both.. i don't need to do indexing, as i'm
not going to be doing any query searching, at least not initially...

i'm also trying to understand just what gets returned when i 'crawl' a
portion of a site.. do i get information back in a series of html files.. do
i get a db of information, just what do i get..??

i'm looking at being able to take a given url www.foo.com, and to be able to
crawl through a portion of the site.. need to figure out how to accomplish
this... and once i have the returned information (if it's in a file/txt
format) i'd like to be able to extract certain information based upon the
DOM of the page... if the returned information from the 'crawler' is of a
textfile format, i can easily create a parsing function to go through the
files and generate the information...

can someone provide me with insight as to whether lucene/nutch is the way to
go with this project..

thanks

-bruce
=


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message