lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert.Hasti...@ancept.com
Subject Re: Can POI provide reliable text extraction results for productionsearch engine for Word, Excel and PowerPoint formats?
Date Tue, 13 May 2008 13:22:03 GMT
We are using Aspose: www.aspose.com.  We are still in pre-release, it 
works fine for all of the MS products.  It's commercial, but is a good 
deal as long as you don't have too many developers working on it, since 
the licensing is per seat.  We had a little trouble with thier PDF 
product.  The other thing is that their main product line is .NET but the 
Java line has kept up pretty well.  For text extraction the APIs are 
straight forward.





mark harwood <markharw00d@yahoo.co.uk> 
05/13/2008 07:44 AM
Please respond to
java-user@lucene.apache.org


To
java-user@lucene.apache.org
cc

Subject
Re: Can POI provide reliable text extraction results for productionsearch 
engine for Word, Excel and PowerPoint formats?






On the commercial front, Oracle's "Outside In" (previously Stellent) is 
the one that gets used in a lot of search engines.

Being a C-based product though, integration isn't quite as nice/easy as 
pure Java solutions.


----- Original Message ----
From: Bowesman Antony <adb@teamware.com>
To: java-user@lucene.apache.org
Sent: Tuesday, 13 May, 2008 8:49:00 AM
Subject: Re: Can POI provide reliable text extraction results for 
productionsearch engine for Word, Excel and PowerPoint formats?

We are using POI 3.0.2 FINAL.  Like you, it is not very reliable for many 
Word 
files.  It does not support Word 2, Fast saved files, files which are not 
padded 
to 256 bytes.  PPT and Excel are quite bad, a large % of our PPT files 
throw 
Exceptions.  Not tried 3.1 as it's just gone BETA 1, but I expect that the 
Word 
parsing is unchanged and the changelog doesn't show any Word changes.

TestMining.org http://www.textmining.org/ is quite good, but the 0.4 
version did 
not do Word 2 or Fast Saved files.  1.0 version should fix that, but I've 
not 
yet tried it.  Licene for 1.0 is LGPL, whereas 0.4 was Apache 2.

AbiWord http://www.abisource.com/ is pretty good, but it's a complete GUI 
so is 
quite slow if you want to use it for a lot of parsing.  It can do text 
extraction via the command line.  The Linux versions suports pipes. It's 
based on WvWare http://wvware.sourceforge.net/

Catdoc (http://ftp.wagner.pp.ru/~vitus/software/catdoc/) is quite 
effective, 
fast.  It also has catppt.  I'm not sure if the text order is 100% 
according to 
the original though.

The last two are not licence friendly for distribution.

I've extracted the Nutch parsing framework and am using it in our product 
and 
have tested all of the above and the priority for Word parsing is 
TextMining 
v0.4, before POI and then the other two which I plugged in via the 
parse-ext parser.

HTH
Antony





Lukas Vlcek wrote:
> Hi,
> 
> I need to find a reliable way how to extract content out of Word, Excel 
and
> PowerPoint formats prior to indexing and I am not sure if POI is the 
best
> way to go. Can anybody share experience with POI and/or other 
[commercial]
> Java library for text extraction from MS formats?
> 
> My experience with POI is such that sometimes it can be a pain to get 
the
> content out of the MS files properly. I also know that Nutch plugin uses 
POI
> for MS formats but as far as I remember it is not 100% reliable (my more
> then one year old experience is that about 1-2% of files were not 
parsed).
> 
> My requirements are that the text extraction software must run on Linux 
and
> should be written in Java, it can be open source or commercial library.
> 
> Regards,
> Lukas
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message