lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrian Dumitru" <>
Subject classic scenario
Date Wed, 26 May 2004 21:31:17 GMT
I salute the Lucene community!
it will be a great help for me if I get your valuable opinions on the
following issue; I know I could've find more answers to my questions from
reading the documentation but I did invest some time on this and still
have these questions:

I am (also) building a web crawler, a topic specific one to be more
precise, for a vortal. I recently learned about Lucene and I'd very much
like to use it in order to handle keyword specific searched on the info
that I collect.
I suspect this is a "classic" project, at least for Lucene, probably
something like this has been addressed already on this disussion list, I'm
interested to hear any experience anyone might have with this subject.
My crawler goes on the internet, extracts/parse/ranks and saves websites,
most of the information is also categoriezed and stored in the database
but I also save about 10 top pages from each site in the filesystem.
The first question is: should I care about indexing these files at the
time I extract them from internet? Or should I index them later, when I
make them available for search?
If yes, then can I still name my files the way I want?(i.e. are there any
constraints in the filenames from Lucene perspective?)
Is it an OK idea to have the same files repository (or index) where the
crawler writes (indexes files) and the search function searches? I guess
performance issues are important here.
Can I still organize the files that I save the way I want? (I planned to
write all the files from a given website on different folders...and the
folders will have as name the id from my database)
I maintain a taxonomy (list of categories)...each website will fall into
one or more of these categories, also each website will have a rank. Does
Lucene have something that I should be aware of related to what I said?

I guess that's it for now...this is more like a pet project for me, a pet
which keeps growing :) I wouldn't mind any help and opinions you can
provide, source code samples, etc.

Big thanks in advance and good luck on your work.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message