lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Goldenberg" <dmitry.goldenb...@weblayers.com>
Subject RE: Word files & Build vs. Buy?
Date Thu, 09 Feb 2006 16:06:27 GMT
Chris,
 
Awesome stuff. A few questions: is your Excel extractor somehow better than POI's? and, what
do you see as the timeframe for adding WordPerfect support? Are you considering supporting
any other sources such as MS Project, Framemaker, etc?
 
Thanx,
- Dmitry

________________________________

From: Christiaan Fluit [mailto:christiaan.fluit@aduna.biz]
Sent: Thu 2/9/2006 4:09 AM
To: java-user@lucene.apache.org
Subject: Re: Word files & Build vs. Buy?



Hello all,

I'm replying to two threads at once as what I have to say relates to both.

My company recently started an open source project called Aperture
(http://sourceforge.net/projects/aperture), together with the German
DFKI institute. The project is still very much in alpha stage, but I do
believe we already have some code parts that could help people here.

Basically, it's a framework for crawling information sources (file
systems, mail folders, websites, ...) and extracting as much information
from it as possible. Besides full-text extraction, we also put a lot of
effort in extraction and modeling of the metadata occurring in these
sources and document formats. Both parties have some proprietary code
lying on the shelf that is being open sourced and ported to the Aperture
architecture.

Now on to the raised questions:

arnaudbuffet@free.fr wrote:
> WordDocument wd = new WordDocument(is);

jwang@dicarta.com wrote:
> MS Word - I know that POI exists, but development on the Word portion
> seems to have stopped, and there are a lot of nasty looking bugs in
> their DB.  Since we're involved in dealing with contracts, many of our
> Word files are large and complicated.  How has everyone's experience
> with POI's Word parsing been?

My experience is that the WordDocument class crashes on about 25% of the
documents, i.e. it throws some sort of Exception. I've tested POI
2.5.1-final as well as the current code in CVS, but both produce this
result. I even suspect the output to be 100% the same, but I haven't
verified this.

Another reason I don't like this class is that it operates on an
InputStream and internally creates a POIFSFileSystem which you cannot
access, so that it becomes hard to extract document metadata as well
(for which you need the PFSFS) without buffering the entire InputStream.
The same applies to TextMining's WordExtractor, which also operates on
top of lower level POI components.

I've recently committed a WordExtractor to Aperture that uses its own
code operating on these lower level POI datastructures, which works a
lot better, failing only 5% of my 300 test docs. I don't pretend to
understand all the internals of the POI APIs, but it Works For Me.

When POI throws an exception, the WordExtractor will revert to applying
a heuristic string extraction algorithm to extract as much
human-readable text as possible from the binary stream, which works
quite well on MS Office files, i.e. the output is reasonably well for
indexing purposes.

Be sure to checkout Aperture from CVS as this code isn't part of the
alpha 1 release. A next official release is expected in a month.

jwang@dicarta.com wrote:
> RTF - javax.swing looks fine, we use those classes already.

Swing's RTFEditorKit does indeed work surpringly well. "Surprisingly"
because in the past I had many issues with it, typically throwing
exceptions on 25-50% of my test documents. Recently I haven't seen a
single one (using Java 1.5.0), so either I am now feeding it a more
optimal document set or the Swing people have worked on the
implementation. In that case people using Java 1.4.x may see different
results.

> Word Perfect - There doesn't seem to be any converters for this format?

I'm actively working on this :) We have some proprietary code that will
become part of Aperture. Right now I cannot say how well it performs in
practice though, although we've never had complaints with our
proprietary apps.

The code uses a heuristic string extraction algorithm tuned for
WordPerfect documents. This may be an issue, e.g. when you also want to
display the extraction results to end users.

If you're interested: one way you can help me get the most out of it is
by sending me some example WordPerfect documents because I hardly have
those on my hard drive. Fake documents made with very new or old
WordPerfect versions are also most welcome.


Regards,

Chris
http://aduna.biz
--

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





Mime
View raw message