Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm
Precedence: bulk
Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Message-ID: <20030427191612.49787.qmail@mail.com>
Content-Type: text/plain; charset="iso-8859-1"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
MIME-Version: 1.0
From: "Gimmy Pegoraro" <kenshir@mail.com>
To: lucene-user@jakarta.apache.org
Date: Sun, 27 Apr 2003 14:16:12 -0500
Subject: my lucene implementation

Good morning.

First of all, congratulations to all Lucene developers for their great work.
And thank you very much for the precious support offered by these mailing lists.

I used Lucene as the nucleus of the application I developed for my graduation thesis.
Now I'm submitting my work to this list, and I hope it will be usefull for some Lucene users.

You can download the whole application from this URL:
http://www.nsw2001.com/kenshir/lucy/lucy1.1.exe
self-extracting rar archive, about 18 MB

or from this URL:
http://www.nsw2001.com/kenshir/lucy/lucy1.1_NO_JVM.exe
the same, but without an own java virtual machine. About 3 MB.

If you download the last one, you have to insert the path of the local java virtual machine in the file jvm.bat, after the installation.

The name I gave to my application is "Lucy", as it's actually an implementation of Lucene, or better an integration of Lucene with some other good open source programs.

The last version I developed is the 1.1.
This is its structure, in detail:

Lucy 1.1	-> Lucene 1.2
		-> HTMLParser 1.2
		-> PdfBox 0.5.6
		-> wvWare 0.7.2-3
		-> xlhtml 0.4.9
		-> antiword 0.33
		-> Xpdf 2.01 
		-> Snowball 0.1
		-> NGramJ 01.12.11

		-> it.corila.lucy	-> IndexAll.java
					-> SearchIndex.java
					-> HTMLDocument.java
					-> PDFDocument.java
					-> ExternalParser.java
					-> ItalianStemFilter.java
					-> EnglishStemFilter.java
					-> ApostropheFilter.java
					-> IndexAnalyzer.java
					-> SearchAnalyzer.java
					-> LanguageCategorizer
					-> NgramjCategorizer.java

		-> lucyweb.war		-> configuration.jsp
					-> header.jsp
					-> footer.jsp
					-> index.jsp
					-> results.jsp
					-> view.jsp
					-> pagina1.jsp
					-> pagina2.jsp
					-> help.jsp
					

Procedures of indexing, upgrading and searching are implemented by the following batch files:
- indicizza.bat
- aggiorna.bat
- cerca.bat
The jsp module lucyweb.war implements searches with a web browser interface.

Main characteristics of Lucy are:
1) it's able to index the following file types, performing plain text extraction:
  - Microsoft doc, ppt, xls
  - Adobe pdf
  - obviously html and txt, such as Lucene demo does.
2) it indexes and searches documents written in English and in Italian, with a specific stemming procedure
3) it has a configuration file that the user can modify to specify how the application has to work 
4) it produces a set of log files, so the user can control the results of the last indexing process

The parsing of the different file types is done both by Java applications (such as PDFBox) and by not-Java applications (such as wvWare). In this second case, the external program is driven with the Runtime class, and its output is written in a temporary file, stored in a directory made by the program for this specific purpose. The user can choose (in the configuration file properties.txt) that this temporary directory is not automatically removed by the application at the end of the indexing process. I think that this opportunity can be useful in case of errors produced by parsing processes.

In some case (doc and pdf) the user can also choose, in the configuration file of Lucy, which application must to be used for the parsing process.
Modifying the configuration file, the user can use both the available applications in two subsequent processes of indexing and updating. In this way he can probably reach better results than with a unique parser. I implemented this possibility because the parsing process is really difficult for doc and pdf files and often causes indexing errors, even if the open source applications I used are really well made.

The stemming automatic procedure is done thanks to a language categorizer (NgramJ) and specific stemming algorithms (Snowball). The application recognizes French and German text too, but a specific stemming procedure is not yet implemented, so French text is stemmed as Italian text, German text like English text. This is due to the limited time I had for developing, sorry! :)

About log files: they are stored in a user-defined directoy (specified in properties.txt file), and they are called:
- Indexlog.txt:  general log file, contains the output of the indexing process
- DOClog.txt, XLSlog.txt, PPTlog.txt, PDFlog.txt:  they contain the output of the specific external parser

That's all, I think.
You can find more specific instructions in "lucy readme.txt" file, which is stored into the main directory of the installation.
My thesis is downloadable from this URL:
http://www.nsw2001.com/kenshir/lucy/Progettazione e realizzazione di un motore di ricerca per do.pdf

I'm sorry that all these files and all comments in the source code are written in Italian, and also some messages of the indexing process (anyway, I hope they are comprensible in that context) and the help jsp page. My English is poor, as you can see, so I wrote all in Italian to save time! This is also the reason why this e-mail is so long.
If anyone would be as willing to translate all that stuff in English, I would be very grateful to him.

I'm sure the code I wrote may be deeply improved, because this was my first Java-programming experience... but well...it seems to work! ;-P Any modification or suggestion will be appreciated.

In a close future, Lucy will become the search engine of Co.Ri.La consortium (http://www.corila.it), and obviously the "powered by Lucene" logo will appear on the main search page.
Thank you, bye bye

Gimmy Pegoraro


-- 
__________________________________________________________
Sign-up for your own FREE Personalized E-mail at Mail.com
http://www.mail.com/?sr=signup


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org