lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kelvin Tan" <>
Subject Re: PDF / Word document parsers
Date Fri, 19 Apr 2002 06:25:10 GMT

I've experienced a moderate amount of success using Etymon for PDF parsing.
It does consume quite alot of memory for larger PDF documents, but otherwise
it's ok. What difficulties are you facing?

For MS Word parsing, The Jakarta POI project is working something out, but
in the meanwhile I've managed to search MS Word documents by reading the
file and stripping out nonsense characters. It's a hack I think, but if I
increase the indexWriter's maxFieldLength to about a million, I can search
like 13-15MB word documents with ease.

----- Original Message -----
From: "Anita Srinivas" <>
To: "Lucene Users List" <>
Sent: Friday, April 19, 2002 2:13 PM
Subject: PDF / Word document parsers


I have been looking for PDF and Word document parsers.  I have tried the
contributions page on the Lucene site as suggested by a Lucene User. The
PJEtymon does not have a Windows version.  The XPDF does not do the parsing
very well.

Can someone  suggest some better Word document or PDF parsers other than the
ones I mentioned here, .


Anita Srinivas

To unsubscribe, e-mail:   <>
For additional commands, e-mail: <>

View raw message