lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bowesman Antony" <...@teamware.com>
Subject Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats?
Date Tue, 13 May 2008 07:49:00 GMT
We are using POI 3.0.2 FINAL.  Like you, it is not very reliable for many Word 
files.  It does not support Word 2, Fast saved files, files which are not padded 
to 256 bytes.  PPT and Excel are quite bad, a large % of our PPT files throw 
Exceptions.  Not tried 3.1 as it's just gone BETA 1, but I expect that the Word 
parsing is unchanged and the changelog doesn't show any Word changes.

TestMining.org http://www.textmining.org/ is quite good, but the 0.4 version did 
not do Word 2 or Fast Saved files.  1.0 version should fix that, but I've not 
yet tried it.  Licene for 1.0 is LGPL, whereas 0.4 was Apache 2.

AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI so is 
quite slow if you want to use it for a lot of parsing.  It can do text 
extraction via the command line.  The Linux versions suports pipes.    It's 
based on WvWare http://wvware.sourceforge.net/

Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite effective, 
fast.  It also has catppt.  I'm not sure if the text order is 100% according to 
the original though.

The last two are not licence friendly for distribution.

I've extracted the Nutch parsing framework and am using it in our product and 
have tested all of the above and the priority for Word parsing is TextMining 
v0.4, before POI and then the other two which I plugged in via the parse-ext parser.

HTH
Antony





Lukas Vlcek wrote:
> Hi,
> 
> I need to find a reliable way how to extract content out of Word, Excel and
> PowerPoint formats prior to indexing and I am not sure if POI is the best
> way to go. Can anybody share experience with POI and/or other [commercial]
> Java library for text extraction from MS formats?
> 
> My experience with POI is such that sometimes it can be a pain to get the
> content out of the MS files properly. I also know that Nutch plugin uses POI
> for MS formats but as far as I remember it is not 100% reliable (my more
> then one year old experience is that about 1-2% of files were not parsed).
> 
> My requirements are that the text extraction software must run on Linux and
> should be written in Java, it can be open source or commercial library.
> 
> Regards,
> Lukas
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message