Return-Path: Mailing-List: contact poi-dev-help@jakarta.apache.org; run by ezmlm Delivered-To: mailing list poi-dev@jakarta.apache.org Received: (qmail 56489 invoked from network); 20 May 2003 13:11:28 -0000 Received: from tdsupport.ksc.nasa.gov (163.205.16.2) by daedalus.apache.org with SMTP; 20 May 2003 13:11:28 -0000 Received: from tdryan (tdryan.ksc.nasa.gov [163.205.143.13]) by tdsupport.ksc.nasa.gov (8.11.6/8.11.6) with SMTP id h4K8Gwv09570 for ; Tue, 20 May 2003 04:16:58 -0400 Message-ID: <002301c31ed1$dbf6f960$0d8fcda3@tdryan> Reply-To: "Ryan Ackley" From: "Ryan Ackley" To: "POI Developers List" References: <5.1.0.14.0.20030520134853.02a63000@mail.jahia.com> Subject: Re: Interested pure Word text extraction patch ? Date: Tue, 20 May 2003 09:15:17 -0400 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1158 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2800.1165 X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N You can submit it and I will try to find a place for it, I am not promising anything. FYI, I wrote a little library to do text extraction from Word documents with POI. I am using it for my thesis research. You can get it at http://www.textmining.org . You may have had a problem with some of your documents because they were fast-saved. I didn't even attempt to support that because it didn't seem worth the effort to support the very few documents that are fast-saved. Ryan Ackley ----- Original Message ----- From: "Serge Huber" To: Sent: Tuesday, May 20, 2003 7:49 AM Subject: Interested pure Word text extraction patch ? > > Hi all, > > First of all thanks for all the work that went into POI . I've only started > working with the code recently, and I must say there's a kind of "magic" to > be finally reading file formats that seem to have been voluntarily > obfuscated :) > > Anyway, I have been working on integration of POI with Lucene, mostly to > get Word file indexing working well enough to fit my needs. Despite the > fact that I still have some problems with some "complex" files, the result > is acceptable for now. > > I must admit that my modifications are quite "hacky", and I'm not sure if > they are fitted for an real patch. Should I submit my modifications as they > are into bugzilla or should I host somewhere else my modifications so that > people can try them out ? > > The modifications I've done are : > - deactivate formatting parsing. I didn't need it so I commented out the > "findFormatting" in the WordDocument class > - small patches here and there to remove exceptions > - modifications to fall-back to main stream document text if the parsing of > the piece tables seemed to give nothing (it seems there are a lot of > problems with some files here but I'm not knowledgeable about the format > enough to know what I'm doing). And it seems the binary file format > document is not telling us everything that is really going on here :( > - modifications in the writeAllText method of the WordDocument > > The result I got : > - I tested on the 384 Word files I found on my computer > - 1 couldn't be parsed at all becuase of a signature problem (POIFS problem ?) > - 3 were actually RTF files so they are ignored > - 5 files seemed to have problem with piece tables. If I "Save As..." the > files to transform into "simple" files the text extraction works fine. The > piece table seemed to always point me to text after the value of fib.fcMax. > Here I made a patch the reverts to the main document text stream in this case > - 4 files had piece tables that covered some of the main document stream > and some parts outside, which means I only got part of the text in my > extractions. > - the rest of the files worked very well ! > > I'm sorry to say that most of these files are not test cases I could send > off just like this as some of the data is personal and/or not for public > eyes. I also seemed to have problems with the test case files that were > included in POI, that don't even work on the real MS Word ! > > Basically what I can do not is I have a method that looks like this : > > public String HDFExtractor.getHDFContent(File f); > > That gives me a String containing all the text of an HDF encoded file. I > then index this into Lucene to do the text indexing. It doesn't work with > every Word file I've encountered but it's better than nothing for me. > > Let me know if you still want me to contribute my "hacks" (or patches if > you prefer)... > > Regards, > Serge Huber. > > > - -- --- -----=[ serge.huber at jahia dot com ]=---- --- -- - > Jahia : A collaborative source CMS and Portal Server > www.jahia.org Community and product web site > www.jahia.com Commercial services company > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: poi-dev-help@jakarta.apache.org >