Return-Path: Delivered-To: apmail-jakarta-lucene-user-archive@apache.org Received: (qmail 9446 invoked from network); 27 Apr 2003 19:16:10 -0000 Received: from exchange.sun.com (192.18.33.10) by daedalus.apache.org with SMTP; 27 Apr 2003 19:16:10 -0000 Received: (qmail 25613 invoked by uid 97); 27 Apr 2003 19:18:13 -0000 Delivered-To: qmlist-jakarta-archive-lucene-user@nagoya.betaversion.org Received: (qmail 25606 invoked from network); 27 Apr 2003 19:18:13 -0000 Received: from daedalus.apache.org (HELO apache.org) (208.185.179.12) by nagoya.betaversion.org with SMTP; 27 Apr 2003 19:18:13 -0000 Received: (qmail 9158 invoked by uid 500); 27 Apr 2003 19:16:07 -0000 Mailing-List: contact lucene-user-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Lucene Users List" Reply-To: "Lucene Users List" Delivered-To: mailing list lucene-user@jakarta.apache.org Received: (qmail 9146 invoked from network); 27 Apr 2003 19:16:07 -0000 Received: from 205-158-62-158.outblaze.com (HELO spf1.us.outblaze.com) (205.158.62.158) by daedalus.apache.org with SMTP; 27 Apr 2003 19:16:07 -0000 Received: (qmail 5488 invoked from network); 27 Apr 2003 19:15:30 -0000 Received: from unknown (205.158.62.68) by spf1.us.outblaze.com with QMQP; 27 Apr 2003 19:15:30 -0000 Received: (qmail 59113 invoked from network); 27 Apr 2003 19:16:12 -0000 Received: from unknown (HELO ws1-10.us4.outblaze.com) (205.158.62.111) by 205-158-62-153.outblaze.com with SMTP; 27 Apr 2003 19:16:12 -0000 Received: (qmail 49788 invoked by uid 1001); 27 Apr 2003 19:16:12 -0000 Message-ID: <20030427191612.49787.qmail@mail.com> Content-Type: text/plain; charset="iso-8859-1" Content-Disposition: inline Content-Transfer-Encoding: 7bit MIME-Version: 1.0 X-Mailer: MIME-tools 5.41 (Entity 5.404) Received: from [151.25.89.177] by ws1-10.us4.outblaze.com with http for kenshir@mail.com; Sun, 27 Apr 2003 14:16:12 -0500 From: "Gimmy Pegoraro" To: lucene-user@jakarta.apache.org Date: Sun, 27 Apr 2003 14:16:12 -0500 Subject: my lucene implementation X-Originating-Ip: 151.25.89.177 X-Originating-Server: ws1-10.us4.outblaze.com X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Good morning. First of all, congratulations to all Lucene developers for their great work. And thank you very much for the precious support offered by these mailing lists. I used Lucene as the nucleus of the application I developed for my graduation thesis. Now I'm submitting my work to this list, and I hope it will be usefull for some Lucene users. You can download the whole application from this URL: http://www.nsw2001.com/kenshir/lucy/lucy1.1.exe self-extracting rar archive, about 18 MB or from this URL: http://www.nsw2001.com/kenshir/lucy/lucy1.1_NO_JVM.exe the same, but without an own java virtual machine. About 3 MB. If you download the last one, you have to insert the path of the local java virtual machine in the file jvm.bat, after the installation. The name I gave to my application is "Lucy", as it's actually an implementation of Lucene, or better an integration of Lucene with some other good open source programs. The last version I developed is the 1.1. This is its structure, in detail: Lucy 1.1 -> Lucene 1.2 -> HTMLParser 1.2 -> PdfBox 0.5.6 -> wvWare 0.7.2-3 -> xlhtml 0.4.9 -> antiword 0.33 -> Xpdf 2.01 -> Snowball 0.1 -> NGramJ 01.12.11 -> it.corila.lucy -> IndexAll.java -> SearchIndex.java -> HTMLDocument.java -> PDFDocument.java -> ExternalParser.java -> ItalianStemFilter.java -> EnglishStemFilter.java -> ApostropheFilter.java -> IndexAnalyzer.java -> SearchAnalyzer.java -> LanguageCategorizer -> NgramjCategorizer.java -> lucyweb.war -> configuration.jsp -> header.jsp -> footer.jsp -> index.jsp -> results.jsp -> view.jsp -> pagina1.jsp -> pagina2.jsp -> help.jsp Procedures of indexing, upgrading and searching are implemented by the following batch files: - indicizza.bat - aggiorna.bat - cerca.bat The jsp module lucyweb.war implements searches with a web browser interface. Main characteristics of Lucy are: 1) it's able to index the following file types, performing plain text extraction: - Microsoft doc, ppt, xls - Adobe pdf - obviously html and txt, such as Lucene demo does. 2) it indexes and searches documents written in English and in Italian, with a specific stemming procedure 3) it has a configuration file that the user can modify to specify how the application has to work 4) it produces a set of log files, so the user can control the results of the last indexing process The parsing of the different file types is done both by Java applications (such as PDFBox) and by not-Java applications (such as wvWare). In this second case, the external program is driven with the Runtime class, and its output is written in a temporary file, stored in a directory made by the program for this specific purpose. The user can choose (in the configuration file properties.txt) that this temporary directory is not automatically removed by the application at the end of the indexing process. I think that this opportunity can be useful in case of errors produced by parsing processes. In some case (doc and pdf) the user can also choose, in the configuration file of Lucy, which application must to be used for the parsing process. Modifying the configuration file, the user can use both the available applications in two subsequent processes of indexing and updating. In this way he can probably reach better results than with a unique parser. I implemented this possibility because the parsing process is really difficult for doc and pdf files and often causes indexing errors, even if the open source applications I used are really well made. The stemming automatic procedure is done thanks to a language categorizer (NgramJ) and specific stemming algorithms (Snowball). The application recognizes French and German text too, but a specific stemming procedure is not yet implemented, so French text is stemmed as Italian text, German text like English text. This is due to the limited time I had for developing, sorry! :) About log files: they are stored in a user-defined directoy (specified in properties.txt file), and they are called: - Indexlog.txt: general log file, contains the output of the indexing process - DOClog.txt, XLSlog.txt, PPTlog.txt, PDFlog.txt: they contain the output of the specific external parser That's all, I think. You can find more specific instructions in "lucy readme.txt" file, which is stored into the main directory of the installation. My thesis is downloadable from this URL: http://www.nsw2001.com/kenshir/lucy/Progettazione e realizzazione di un motore di ricerca per do.pdf I'm sorry that all these files and all comments in the source code are written in Italian, and also some messages of the indexing process (anyway, I hope they are comprensible in that context) and the help jsp page. My English is poor, as you can see, so I wrote all in Italian to save time! This is also the reason why this e-mail is so long. If anyone would be as willing to translate all that stuff in English, I would be very grateful to him. I'm sure the code I wrote may be deeply improved, because this was my first Java-programming experience... but well...it seems to work! ;-P Any modification or suggestion will be appreciated. In a close future, Lucy will become the search engine of Co.Ri.La consortium (http://www.corila.it), and obviously the "powered by Lucene" logo will appear on the main search page. Thank you, bye bye Gimmy Pegoraro -- __________________________________________________________ Sign-up for your own FREE Personalized E-mail at Mail.com http://www.mail.com/?sr=signup --------------------------------------------------------------------- To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org For additional commands, e-mail: lucene-user-help@jakarta.apache.org