lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Richmond" <richmondm...@gmail.com>
Subject Re: use of lucene app..
Date Wed, 21 Jun 2006 02:07:04 GMT
Hello Bruce,

> is there a way to set lucene so that it only parses/crawls through a given
> portion of a website...

Lucene does not crawl anything on its own.  It is simply a search
engine library.  All indexing and crawling must be done by an
application that you create.


> also, when lucene returns data, can it be slammed/put into a large file/db
> structure so i can extract the requisite information from it.

I know that it is possible to setup lucene to store its data in a SQL database

> btw, how does lucene/nutch compare to heritrix?

I'm not sure, I have never used heritrix.

Hope some of this was helpful.


--Mike




On 6/20/06, bruce <bedouglas@earthlink.net> wrote:
> hi..
>
> is there a way to set lucene so that it only parses/crawls through a given
> portion of a website...
>
> i have a college site. i'm looking at simply extracting all the information
> for a given section of the site, ie the registrar section... if i can
> determine that i want all the pages underneath a given url, can lucene be
> used as a possible solution...
>
> also, when lucene returns data, can it be slammed/put into a large file/db
> structure so i can extract the requisite information from it.
>
> i'm wondering if i already more or less know the DOM structure of the
> information i'm looking for, i could simply crawl the given section of the
> college site, if i could figure out a way to limit the amount of information
> that's returned.. i could then do a regex kind of search across the returned
> pages...
>
> my overall goal is to extract certain pieces of information from the section
> of the college site that i crawl...
>
> btw, how does lucene/nutch compare to heritrix?
>
> thanks...
>
> -bruce
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message