lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: classic scenario
Date Thu, 27 May 2004 12:40:11 GMT
Hello,

Answers inlined.

--- Adrian Dumitru <ctrl@altonsys.com> wrote:

> I am (also) building a web crawler, a topic specific one to be more
> precise, for a vortal. I recently learned about Lucene and I'd very
> much
> like to use it in order to handle keyword specific searched on the
> info
> that I collect.
> I suspect this is a "classic" project, at least for Lucene, probably
> something like this has been addressed already on this disussion
> list, I'm
> interested to hear any experience anyone might have with this
> subject.

See http://www.nutch.org/
It may make sense to join Nutch, contribute patches that help you, etc.
instead of building your own crawler from scratch.

> My crawler goes on the internet, extracts/parse/ranks and saves
> websites,
> most of the information is also categoriezed and stored in the
> database
> but I also save about 10 top pages from each site in the filesystem.
> The first question is: should I care about indexing these files at
> the
> time I extract them from internet? Or should I index them later, when
> I
> make them available for search?

Lucene does not care about files and is not limited to indexing files. 
It sounds like you tried the Lucene demo that indexes files in the file
system.

However, indexing in batch instead of as you crawl may be a more
scalable and cleaner, more manageable approach.  Nutch uses that
approach for a reason. :)

> If yes, then can I still name my files the way I want?(i.e. are there
> any
> constraints in the filenames from Lucene perspective?)

No constraints.

> Is it an OK idea to have the same files repository (or index) where
> the
> crawler writes (indexes files) and the search function searches?

Not a good idea.  Keep your Lucene index directory clean, and use it
only as an index directory.  Write your files elsewhere, I would
suggest.

> I
> guess
> performance issues are important here.
> Can I still organize the files that I save the way I want? (I planned
> to
> write all the files from a given website on different folders...and
> the
> folders will have as name the id from my database)

That is up to you and your application.  I just suggest you keep that
outside the index directory, in order to keep things clean, well
organized, and such.

> I maintain a taxonomy (list of categories)...each website will fall
> into
> one or more of these categories, also each website will have a rank.
> Does
> Lucene have something that I should be aware of related to what I
> said?

Lucene ranks search result items.  Look at Similarity and
DefaultSimilarity classes.  It sounds like you may benefit from having
a custom Similarity that is aware of your categories.

> I guess that's it for now...this is more like a pet project for me, a
> pet
> which keeps growing :) I wouldn't mind any help and opinions you can
> provide, source code samples, etc.

It this is really a pet project, perhaps joining Nutch will also be fun
for you.  Some recent Nutch contributors are also Lucene users.

Otis


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message