Mailing-List: contact poi-dev-help@jakarta.apache.org; run by ezmlm
Message-ID: <002301c31ed1$dbf6f960$0d8fcda3@tdryan>
Reply-To: "Ryan Ackley" <sackley@apache.org>
From: "Ryan Ackley" <sackley@cfl.rr.com>
To: "POI Developers List" <poi-dev@jakarta.apache.org>
References: <5.1.0.14.0.20030520134853.02a63000@mail.jahia.com>
Subject: Re: Interested pure Word text extraction patch ?
Date: Tue, 20 May 2003 09:15:17 -0400
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

You can submit it and I will try to find a place for it, I am not promising
anything.

FYI, I wrote a little library to do text extraction from Word documents with
POI. I am using it for my thesis research. You can get it at
http://www.textmining.org .

You may have had a problem with some of your documents because they were
fast-saved. I didn't even attempt to support that because it didn't seem
worth the effort to support the very few documents that are fast-saved.

Ryan Ackley

----- Original Message ----- 
From: "Serge Huber" <shuber2@jahia.com>
To: <poi-dev@jakarta.apache.org>
Sent: Tuesday, May 20, 2003 7:49 AM
Subject: Interested pure Word text extraction patch ?


>
> Hi all,
>
> First of all thanks for all the work that went into POI . I've only
started
> working with the code recently, and I must say there's a kind of "magic"
to
> be finally reading file formats that seem to have been voluntarily
> obfuscated :)
>
> Anyway, I have been working on integration of POI with Lucene, mostly to
> get Word file indexing working well enough to fit my needs. Despite the
> fact that I still have some problems with some "complex" files, the result
> is acceptable for now.
>
> I must admit that my modifications are quite "hacky", and I'm not sure if
> they are fitted for an real patch. Should I submit my modifications as
they
> are into bugzilla or should I host somewhere else my modifications so that
> people can try them out ?
>
> The modifications I've done are :
> - deactivate formatting parsing. I didn't need it so I commented out the
> "findFormatting" in the WordDocument class
> - small patches here and there to remove exceptions
> - modifications to fall-back to main stream document text if the parsing
of
> the piece tables seemed to give nothing (it seems there are a lot of
> problems with some files here but I'm not knowledgeable about the format
> enough to know what I'm doing). And it seems the binary file format
> document is not telling us everything that is really going on here :(
> - modifications in the writeAllText method of the WordDocument
>
> The result I got :
> - I tested on the 384 Word files I found on my computer
> - 1 couldn't be parsed at all becuase of a signature problem (POIFS
problem ?)
> - 3 were actually RTF files so they are ignored
> - 5 files seemed to have problem with piece tables. If I "Save As..." the
> files to transform into "simple" files the text extraction works fine. The
> piece table seemed to always point me to text after the value of
fib.fcMax.
> Here I made a patch the reverts to the main document text stream in this
case
> - 4 files had piece tables that covered some of the main document stream
> and some parts outside, which means I only got part of the text in my
> extractions.
> - the rest of the files worked very well !
>
> I'm sorry to say that most of these files are not test cases I could send
> off just like this as some of the data is personal and/or not for public
> eyes. I also seemed to have problems with the test case files that were
> included in POI, that don't even work on the real MS Word !
>
> Basically what I can do not is I have a method that looks like this :
>
>          public String HDFExtractor.getHDFContent(File f);
>
> That gives me a String containing all the text of an HDF encoded file. I
> then index this into Lucene to do the text indexing. It doesn't work with
> every Word file I've encountered but it's better than nothing for me.
>
> Let me know if you still want me to contribute my "hacks" (or patches if
> you prefer)...
>
> Regards,
>    Serge Huber.
>
>
> - -- --- -----=[ serge.huber at jahia dot com ]=---- --- -- -
> Jahia : A collaborative source CMS and Portal Server
> www.jahia.org Community and product web site
> www.jahia.com Commercial services company
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: poi-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: poi-dev-help@jakarta.apache.org
>