lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Ackley" <>
Subject New Word Document text extractor released
Date Thu, 04 Mar 2004 06:12:09 GMT
Version 0.4 of the text extraction library has been released!

I have finally gotten around to releasing a new version of the text extractor. This is a pure java library for extracting
text from Word 6.0/97/2000/XP/2003.

Some highlights from this release:

-I removed support for PDF documents. I was only wrapping the excellent
PDFBox ( library with a few lines of code.
-I added support for Word 6.0 documents.
-The extractor will no longer extract text that has been deleted but is
still in the document because of revision tracking
-I added two exceptions, PasswordProtectedException and FastSavedException,
for more graceful failures.
-Fixed bugs
-Updated the license to Apache 2.0

A special thanks to BeeText Inc. (  They are a
software company that is on the cutting edge of software development for
translation professionals. Besides that, they sponsored all of the above
changes. companies that support open source!

-Ryan Ackley

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message