lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <>
Subject Re: Word files & Build vs. Buy?
Date Fri, 10 Feb 2006 15:04:47 GMT
Dmitry Goldenberg wrote:
> Awesome stuff. A few questions: is your Excel extractor somehow
> better than POI's? and, what do you see as the timeframe for adding
> WordPerfect support? Are you considering supporting any other sources
> such as MS Project, Framemaker, etc?

I just committed a WordPerfectExtractor ;)

It's based on code developed in-house at Aduna and it seems to work 
quite well on my test collection of WordPerfect documents. Only 
sometimes words are split in the middle, I'm still looking into that.

The test set has a bias for older WordPerfect documents though, I'm 
trying to get my hands on a recent copy of WordPerfect to see if the 
latest format is also supported and to create unit tests for it.

To interactively test the extractor(s) yourselves:

- checkout Aperture from CVS (see
- do "ant release"
- go to build\release\bin and execute fileinspector.bat
- drag any file (WordPerfect or any other format) to see what MIME type 
Aperture thinks it is and to execute the corresponding Extractor, if 
available. The two tabs show the extracted full-text and an RDF dump of 
the metadata. For WordPerfect, only full-text extraction is currently 

Our ExcelExtractor is basically nothing more than glue code between POI 
and the rest of our framework, meaning that an application using the 
framework can request an Extractor implementation for 
"application/", feed it an InputStream and get the text and 
metadata back.

The only advantage of our ExcelExtractor over direct use of POI is that, 
when POI throws an Exception on a particular document, it reverts to a 
heuristic string extraction algorithm which is often able to extract 
full-text from a document with reasonable quality, i.e. suited for indexing.

We are surely considering supporting more formats. Which ones we will 
work on depends on a number of factors, e.g. availability of open source 
libs for that format, complexity of the file format (we did WordPerfect 
by ourselves), customer demand, code contributions from others, etc. In 
any case, if you need support for format XYZ, you can always send me 
some example files and I'll take a look at how hard it is to add support 
for it.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message