jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lansing, Carina S" <Carina.Lans...@pnl.gov>
Subject RE: MsWordTextFilter Problem
Date Wed, 07 Jun 2006 23:21:25 GMT
Hi Thomas,

We encountered the exact same problem.  I did some unit tests, and the
org.textmining.text.extraction.WordExtractor does not work very well.
As you described, it omits whole sections of documents (apparently
triggered by certain formatting fields present in the document).

I noticed in the latest 3.0 alpha1 build of POI (checked out from svn),
that it contains a new WordExtractor class under the scratchpad area:
org.apache.poi.hwpf.extractor.WordExtractor.  This class has an almost
identical API to the org.texmining equivalent.  I did some preliminary
testing, and this new class works much better at text extraction.  All
my Word documents are getting indexed now.  I created my own
MsWordTextFilter using this alternate class, and it is working well, but
I need to do more testing (especially on the other POI-based filters, to
make sure they didn't break from the new POI jarfiles).  Hope this
information is helpful.


-----Original Message-----
From: thomasg [mailto:thomasgascoigne@hotmail.com] 
Sent: Tuesday, May 16, 2006 2:52 AM
To: dev@jackrabbit.apache.org
Subject: MsWordTextFilter Problem

Has anyone encoutered problems with this text filter. I am testing the
text extraction of quite a large document (6MB worth of Thinking In Java
captain Bruce Eckel). Seaching    was not producing expected results. I
taken the Reader object generated by the MsWordTextFilter and converted
it into a String and writen it to a file. Inspection shows that most of
the document has been omitted. The missing part is in the middle of the
file and there are no particularly unusal contents that mark the start
of the missing section. I've tested larger docs that work fine so its a
bit of a mystery?

Cheers, Thomas
View this message in context:
Sent from the Jackrabbit - Dev forum at Nabble.com.

View raw message