Return-Path: Delivered-To: apmail-jackrabbit-users-archive@locus.apache.org Received: (qmail 41310 invoked from network); 1 Nov 2008 21:10:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 1 Nov 2008 21:10:29 -0000 Received: (qmail 94200 invoked by uid 500); 1 Nov 2008 21:10:33 -0000 Delivered-To: apmail-jackrabbit-users-archive@jackrabbit.apache.org Received: (qmail 94178 invoked by uid 500); 1 Nov 2008 21:10:33 -0000 Mailing-List: contact users-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: users@jackrabbit.apache.org Delivered-To: mailing list users@jackrabbit.apache.org Received: (qmail 94167 invoked by uid 99); 1 Nov 2008 21:10:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 Nov 2008 14:10:33 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.142.206.144] (HELO n17.bullet.mail.mud.yahoo.com) (68.142.206.144) by apache.org (qpsmtpd/0.29) with SMTP; Sat, 01 Nov 2008 21:09:16 +0000 Received: from [68.142.200.225] by n17.bullet.mail.mud.yahoo.com with NNFMP; 01 Nov 2008 21:08:41 -0000 Received: from [68.142.201.66] by t6.bullet.mud.yahoo.com with NNFMP; 01 Nov 2008 21:08:41 -0000 Received: from [127.0.0.1] by omp418.mail.mud.yahoo.com with NNFMP; 01 Nov 2008 21:08:41 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 749029.19976.bm@omp418.mail.mud.yahoo.com Received: (qmail 92122 invoked by uid 60001); 1 Nov 2008 21:08:41 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Message-ID; b=rles8pYP4WQMorymZ0lxey9r1gYLZDmRLB1S7lKBGzGtSD2PEWKwvveytrtj7smoMXS6D0MF/tlVH3ZJFzA/fpNpjI+p+pXEwhy/4a7a+2Bi9JX/gR7V3oKy0GAjJiXQf96VRWYiM0KtDnnH1JVG9/l3PWeVKGT/esU0i9wKixc=; X-YMail-OSG: jb8iCuAVM1lRyxuqHJs61jyjHChZhX7E55t5AcWFIrtXZdx9lB9GEJS9V_nxIOZQsuhj9m2TOQ_5iROEyGt8mMDRLy4MuajCMFK9QLs0dKpex5XrNcNqVMuBPuRWuXemutRcfio6a5DaQVdzqf3oAOjRVaHdJJbnp.nA4A4- Received: from [76.119.234.184] by web45215.mail.sp1.yahoo.com via HTTP; Sat, 01 Nov 2008 14:08:40 PDT X-Mailer: YahooMailRC/1155.20 YahooMailWebService/0.7.247.3 Date: Sat, 1 Nov 2008 14:08:40 -0700 (PDT) From: Justin Grunau Subject: MalformedInputException on Linux with MsPowerPointTextExtractor To: users@jackrabbit.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Message-ID: <958684.91331.qm@web45215.mail.sp1.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Jackrabbit text extractors return Readers from their extractText methods. In the case of PowerPoint files, I am finding that on Linux alone, I get the following exception stack trace when I attempt to read anything from the Reader returns from the MsPowerPointTextExtractor.extractText method: sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262) at sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314) at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:345) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250) at sun.nio.cs.StreamDecoder.read0(StreamDecoder.java:199) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:185) at java.io.InputStreamReader.read(InputStreamReader.java:196) Of course I have no control over what encoding any PowerPoint documents happen to be in (nor can I determine the encoding without using some sort of parser to read the file). I also know of no way to tell an InputStreamReader what encoding to convert into. It simply appears that whatever the default encoding of the operating system is (in this case, UTF8) will be used. As of now, I have no way to reliably use the Jackrabbit MsPowerPointTextExtractor on Linux at all -- it works fine for me on Windows. Any suggestions?