lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gimmy Pegoraro" <kens...@mail.com>
Subject my lucene implementation
Date Sun, 27 Apr 2003 19:16:12 GMT
Good morning.

First of all, congratulations to all Lucene developers for their great work.
And thank you very much for the precious support offered by these mailing lists.

I used Lucene as the nucleus of the application I developed for my graduation thesis.
Now I'm submitting my work to this list, and I hope it will be usefull for some Lucene users.

You can download the whole application from this URL:
http://www.nsw2001.com/kenshir/lucy/lucy1.1.exe
self-extracting rar archive, about 18 MB

or from this URL:
http://www.nsw2001.com/kenshir/lucy/lucy1.1_NO_JVM.exe
the same, but without an own java virtual machine. About 3 MB.

If you download the last one, you have to insert the path of the local java virtual machine
in the file jvm.bat, after the installation.

The name I gave to my application is "Lucy", as it's actually an implementation of Lucene,
or better an integration of Lucene with some other good open source programs.

The last version I developed is the 1.1.
This is its structure, in detail:

Lucy 1.1	-> Lucene 1.2
		-> HTMLParser 1.2
		-> PdfBox 0.5.6
		-> wvWare 0.7.2-3
		-> xlhtml 0.4.9
		-> antiword 0.33
		-> Xpdf 2.01 
		-> Snowball 0.1
		-> NGramJ 01.12.11

		-> it.corila.lucy	-> IndexAll.java
					-> SearchIndex.java
					-> HTMLDocument.java
					-> PDFDocument.java
					-> ExternalParser.java
					-> ItalianStemFilter.java
					-> EnglishStemFilter.java
					-> ApostropheFilter.java
					-> IndexAnalyzer.java
					-> SearchAnalyzer.java
					-> LanguageCategorizer
					-> NgramjCategorizer.java

		-> lucyweb.war		-> configuration.jsp
					-> header.jsp
					-> footer.jsp
					-> index.jsp
					-> results.jsp
					-> view.jsp
					-> pagina1.jsp
					-> pagina2.jsp
					-> help.jsp
					

Procedures of indexing, upgrading and searching are implemented by the following batch files:
- indicizza.bat
- aggiorna.bat
- cerca.bat
The jsp module lucyweb.war implements searches with a web browser interface.

Main characteristics of Lucy are:
1) it's able to index the following file types, performing plain text extraction:
  - Microsoft doc, ppt, xls
  - Adobe pdf
  - obviously html and txt, such as Lucene demo does.
2) it indexes and searches documents written in English and in Italian, with a specific stemming
procedure
3) it has a configuration file that the user can modify to specify how the application has
to work 
4) it produces a set of log files, so the user can control the results of the last indexing
process

The parsing of the different file types is done both by Java applications (such as PDFBox)
and by not-Java applications (such as wvWare). In this second case, the external program is
driven with the Runtime class, and its output is written in a temporary file, stored in a
directory made by the program for this specific purpose. The user can choose (in the configuration
file properties.txt) that this temporary directory is not automatically removed by the application
at the end of the indexing process. I think that this opportunity can be useful in case of
errors produced by parsing processes.

In some case (doc and pdf) the user can also choose, in the configuration file of Lucy, which
application must to be used for the parsing process.
Modifying the configuration file, the user can use both the available applications in two
subsequent processes of indexing and updating. In this way he can probably reach better results
than with a unique parser. I implemented this possibility because the parsing process is really
difficult for doc and pdf files and often causes indexing errors, even if the open source
applications I used are really well made.

The stemming automatic procedure is done thanks to a language categorizer (NgramJ) and specific
stemming algorithms (Snowball). The application recognizes French and German text too, but
a specific stemming procedure is not yet implemented, so French text is stemmed as Italian
text, German text like English text. This is due to the limited time I had for developing,
sorry! :)

About log files: they are stored in a user-defined directoy (specified in properties.txt file),
and they are called:
- Indexlog.txt:  general log file, contains the output of the indexing process
- DOClog.txt, XLSlog.txt, PPTlog.txt, PDFlog.txt:  they contain the output of the specific
external parser

That's all, I think.
You can find more specific instructions in "lucy readme.txt" file, which is stored into the
main directory of the installation.
My thesis is downloadable from this URL:
http://www.nsw2001.com/kenshir/lucy/Progettazione e realizzazione di un motore di ricerca
per do.pdf

I'm sorry that all these files and all comments in the source code are written in Italian,
and also some messages of the indexing process (anyway, I hope they are comprensible in that
context) and the help jsp page. My English is poor, as you can see, so I wrote all in Italian
to save time! This is also the reason why this e-mail is so long.
If anyone would be as willing to translate all that stuff in English, I would be very grateful
to him.

I'm sure the code I wrote may be deeply improved, because this was my first Java-programming
experience... but well...it seems to work! ;-P Any modification or suggestion will be appreciated.

In a close future, Lucy will become the search engine of Co.Ri.La consortium (http://www.corila.it),
and obviously the "powered by Lucene" logo will appear on the main search page.
Thank you, bye bye

Gimmy Pegoraro



-- 
__________________________________________________________
Sign-up for your own FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message