lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "bruce" <bedoug...@earthlink.net>
Subject use of lucene app..
Date Wed, 21 Jun 2006 01:13:38 GMT
hi..

is there a way to set lucene so that it only parses/crawls through a given
portion of a website...

i have a college site. i'm looking at simply extracting all the information
for a given section of the site, ie the registrar section... if i can
determine that i want all the pages underneath a given url, can lucene be
used as a possible solution...

also, when lucene returns data, can it be slammed/put into a large file/db
structure so i can extract the requisite information from it.

i'm wondering if i already more or less know the DOM structure of the
information i'm looking for, i could simply crawl the given section of the
college site, if i could figure out a way to limit the amount of information
that's returned.. i could then do a regex kind of search across the returned
pages...

my overall goal is to extract certain pieces of information from the section
of the college site that i crawl...

btw, how does lucene/nutch compare to heritrix?

thanks...

-bruce



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message