jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Grunau <jjm...@yahoo.com>
Subject MalformedInputException on Linux with MsPowerPointTextExtractor
Date Sat, 01 Nov 2008 21:08:40 GMT
Jackrabbit text extractors return Readers from their extractText methods.

In the case of PowerPoint files, I am finding that on Linux alone, I get the following exception
stack trace when I attempt to read anything from the Reader 
returns from the MsPowerPointTextExtractor.extractText method:

        at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262)
        at sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314)
        at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:345)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250)
        at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:199)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:185)
        at java.io.InputStreamReader.read(InputStreamReader.java:196)

Of course I have no control over what encoding any PowerPoint documents happen to be in (nor
can I determine the encoding without using some sort of parser to read the file).  I also
know of no way to tell an InputStreamReader what encoding to convert into.  It simply appears
that whatever the default encoding of the operating system is (in this case, UTF8) will be

As of now, I have no way to reliably use the Jackrabbit MsPowerPointTextExtractor on Linux
at all -- it works fine for me on Windows.  Any suggestions?


View raw message