We are using Aspose: www.aspose.com. We are still in pre-release, it
works fine for all of the MS products. It's commercial, but is a good
deal as long as you don't have too many developers working on it, since
the licensing is per seat. We had a little trouble with thier PDF
product. The other thing is that their main product line is .NET but the
Java line has kept up pretty well. For text extraction the APIs are
straight forward.
mark harwood <markharw00d@yahoo.co.uk>
05/13/2008 07:44 AM
Please respond to
java-user@lucene.apache.org
To
java-user@lucene.apache.org
cc
Subject
Re: Can POI provide reliable text extraction results for productionsearch
engine for Word, Excel and PowerPoint formats?
On the commercial front, Oracle's "Outside In" (previously Stellent) is
the one that gets used in a lot of search engines.
Being a C-based product though, integration isn't quite as nice/easy as
pure Java solutions.
----- Original Message ----
From: Bowesman Antony <adb@teamware.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 13 May, 2008 8:49:00 AM
Subject: Re: Can POI provide reliable text extraction results for
productionsearch engine for Word, Excel and PowerPoint formats?
We are using POI 3.0.2 FINAL. Like you, it is not very reliable for many
Word
files. It does not support Word 2, Fast saved files, files which are not
padded
to 256 bytes. PPT and Excel are quite bad, a large % of our PPT files
throw
Exceptions. Not tried 3.1 as it's just gone BETA 1, but I expect that the
Word
parsing is unchanged and the changelog doesn't show any Word changes.
TestMining.org http://www.textmining.org/ is quite good, but the 0.4
version did
not do Word 2 or Fast Saved files. 1.0 version should fix that, but I've
not
yet tried it. Licene for 1.0 is LGPL, whereas 0.4 was Apache 2.
AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI
so is
quite slow if you want to use it for a lot of parsing. It can do text
extraction via the command line. The Linux versions suports pipes. It's
based on WvWare http://wvware.sourceforge.net/
Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite
effective,
fast. It also has catppt. I'm not sure if the text order is 100%
according to
the original though.
The last two are not licence friendly for distribution.
I've extracted the Nutch parsing framework and am using it in our product
and
have tested all of the above and the priority for Word parsing is
TextMining
v0.4, before POI and then the other two which I plugged in via the
parse-ext parser.
HTH
Antony
Lukas Vlcek wrote:
> Hi,
>
> I need to find a reliable way how to extract content out of Word, Excel
and
> PowerPoint formats prior to indexing and I am not sure if POI is the
best
> way to go. Can anybody share experience with POI and/or other
[commercial]
> Java library for text extraction from MS formats?
>
> My experience with POI is such that sometimes it can be a pain to get
the
> content out of the MS files properly. I also know that Nutch plugin uses
POI
> for MS formats but as far as I remember it is not 100% reliable (my more
> then one year old experience is that about 1-2% of files were not
parsed).
>
> My requirements are that the text extraction software must run on Linux
and
> should be written in Java, it can be open source or commercial library.
>
> Regards,
> Lukas
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
__________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|