lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Ackley" <sack...@cfl.rr.com>
Subject New Word Document text extractor released
Date Thu, 04 Mar 2004 06:12:09 GMT
Version 0.4 of the TextMining.org text extraction library has been released!

I have finally gotten around to releasing a new version of the
textmining.org text extractor. This is a pure java library for extracting
text from Word 6.0/97/2000/XP/2003.

Some highlights from this release:

-I removed support for PDF documents. I was only wrapping the excellent
PDFBox (http://www.pdfbox.org) library with a few lines of code.
-I added support for Word 6.0 documents.
-The extractor will no longer extract text that has been deleted but is
still in the document because of revision tracking
-I added two exceptions, PasswordProtectedException and FastSavedException,
for more graceful failures.
-Fixed bugs
-Updated the license to Apache 2.0

A special thanks to BeeText Inc. (http://www.beetext.com)  They are a
software company that is on the cutting edge of software development for
translation professionals. Besides that, they sponsored all of the above
changes. Remember...support companies that support open source!

-Ryan Ackley


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message