lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <christiaan.fl...@aduna.biz>
Subject Re: Word files & Build vs. Buy?
Date Thu, 09 Feb 2006 12:09:57 GMT
Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture 
(http://sourceforge.net/projects/aperture), together with the German 
DFKI institute. The project is still very much in alpha stage, but I do 
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file 
systems, mail folders, websites, ...) and extracting as much information 
from it as possible. Besides full-text extraction, we also put a lot of 
effort in extraction and modeling of the metadata occurring in these 
sources and document formats. Both parties have some proprietary code 
lying on the shelf that is being open sourced and ported to the Aperture 
architecture.

Now on to the raised questions:

arnaudbuffet@free.fr wrote:
> WordDocument wd = new WordDocument(is);

jwang@dicarta.com wrote:
> MS Word - I know that POI exists, but development on the Word portion
> seems to have stopped, and there are a lot of nasty looking bugs in
> their DB.  Since we're involved in dealing with contracts, many of our
> Word files are large and complicated.  How has everyone's experience
> with POI's Word parsing been?

My experience is that the WordDocument class crashes on about 25% of the 
documents, i.e. it throws some sort of Exception. I've tested POI 
2.5.1-final as well as the current code in CVS, but both produce this 
result. I even suspect the output to be 100% the same, but I haven't 
verified this.

Another reason I don't like this class is that it operates on an 
InputStream and internally creates a POIFSFileSystem which you cannot 
access, so that it becomes hard to extract document metadata as well 
(for which you need the PFSFS) without buffering the entire InputStream. 
The same applies to TextMining's WordExtractor, which also operates on 
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own 
code operating on these lower level POI datastructures, which works a 
lot better, failing only 5% of my 300 test docs. I don't pretend to 
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying 
a heuristic string extraction algorithm to extract as much 
human-readable text as possible from the binary stream, which works 
quite well on MS Office files, i.e. the output is reasonably well for 
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the 
alpha 1 release. A next official release is expected in a month.

jwang@dicarta.com wrote:
> RTF - javax.swing looks fine, we use those classes already.

Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly" 
because in the past I had many issues with it, typically throwing 
exceptions on 25-50% of my test documents. Recently I haven't seen a 
single one (using Java 1.5.0), so either I am now feeding it a more 
optimal document set or the Swing people have worked on the 
implementation. In that case people using Java 1.4.x may see different 
results.

> Word Perfect - There doesn't seem to be any converters for this format?

I'm actively working on this :) We have some proprietary code that will 
become part of Aperture. Right now I cannot say how well it performs in 
practice though, although we've never had complaints with our 
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for 
WordPerfect documents. This may be an issue, e.g. when you also want to 
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is 
by sending me some example WordPerfect documents because I hardly have 
those on my hard drive. Fake documents made with very new or old 
WordPerfect versions are also most welcome.


Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message