Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 26205 invoked from network); 21 Apr 2006 11:24:28 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 21 Apr 2006 11:24:28 -0000 Received: (qmail 97632 invoked by uid 500); 21 Apr 2006 11:24:18 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 97591 invoked by uid 500); 21 Apr 2006 11:24:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 97580 invoked by uid 99); 21 Apr 2006 11:24:17 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Apr 2006 04:24:17 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: domain of lists@nabble.com designates 72.21.53.35 as permitted sender) Received: from [72.21.53.35] (HELO talk.nabble.com) (72.21.53.35) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 21 Apr 2006 04:24:15 -0700 Received: from localhost ([127.0.0.1] helo=talk.nabble.com) by talk.nabble.com with esmtp (Exim 4.50) id 1FWtjr-0003vG-CD for java-user@lucene.apache.org; Fri, 21 Apr 2006 04:23:51 -0700 Message-ID: <4024568.post@talk.nabble.com> Date: Fri, 21 Apr 2006 04:23:51 -0700 (PDT) From: Fisheye To: java-user@lucene.apache.org Subject: Lucene - FileFormat MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Nabble-Sender: dessan@gmx.ch X-Nabble-From: Fisheye X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N Im trying to construct a plaintext parser for different file formats like ms word, excel, powerpoint, rich text format, plain text, html, pdf etc. I use the known libraries PDFBox, POI and some parts from AtLeap...and now I should support the OpenOffice formats and the more important msg-fromat (MS outlook message format). Does someone know how I can simply (like POI) extract plaint text from msg? Probably there exists an open source library like for pdf or ms office files? I need the plain text because the only way for me seems to extract all the plain text from every single document, and then add it to my lucene index...this is necessary to get the best excerpt from highlighter... Thx Simon Dietschi -- View this message in context: http://www.nabble.com/Lucene---FileFormat-t1485959.html#a4024568 Sent from the Lucene - Java Users forum at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org