> I have been looking for PDF and Word document parsers. I have tried the
> contributions page on the Lucene site as suggested by a Lucene User. The
> PJEtymon does not have a Windows version. The XPDF does not do the parsing
> very well.
I've run Etymon with some degree of success in window boxes. To parse word
document you can have a look for OpenOffice. You can start OpenOffice to
receive a socket connection. From your Java app, you open a connection to
OpenOffice (using OpenOffice SDK), send the word document and it will convert
it to text.
You can also use OpenOffice various other parsing. The url: www.openoffice.org
Note: I've never tried OpenOffice under windows, so I'm not sure how it will
work, but we are using it here to index our word documents.
Regards,
--
Victor Hadianto
---------------
More are taken in by hope than by cunning. -- Vauvenargues
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
|