lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <David.Spen...@micromuse.com>
Subject my experiences - Re: Parsing Word Docs
Date Wed, 05 Mar 2003 23:24:55 GMT
FYI I tried the textmining.org/poi combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 33% of them
with exceptions being thrown about the formats being invalid.

I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free 
*.exe, and
it worked great ( well it seemed to process all the files fine).

I've had similar experiences with PDF - I tried the 3 or so 
freeware/java PDF
text extractors and they were not as good as the exe, pdftotext,
from foolabs (http://www.foolabs.com/xpdf/).

Not satisfying to a java developer but these work better than anything 
else I can find.

You get source and I use them on windows & linux, no prob.



Eric Anderson wrote:

>I'm interested in using the textmining/textextraction utilities using Apache 
>POI, that Ryan was discussing. However, I'm having some difficulty determining 
>what the insertion point would be to replace the default parser with the word 
>parser. 
>
>Any assistance would be appreciated.
>
>
>
>
>
>LanRx Network Solutions, Inc.
>Providing Enterprise Level Solutions...On A Small Business Budget
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message