FYI I tried the textmining.org/poi combo and on a collection of 350 word
docs people have developed here over the years, and it failed on 33% of them
with exceptions being thrown about the formats being invalid.
I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
*.exe, and
it worked great ( well it seemed to process all the files fine).
I've had similar experiences with PDF - I tried the 3 or so
freeware/java PDF
text extractors and they were not as good as the exe, pdftotext,
from foolabs (http://www.foolabs.com/xpdf/).
Not satisfying to a java developer but these work better than anything
else I can find.
You get source and I use them on windows & linux, no prob.
Eric Anderson wrote:
>I'm interested in using the textmining/textextraction utilities using Apache
>POI, that Ryan was discussing. However, I'm having some difficulty determining
>what the insertion point would be to replace the default parser with the word
>parser.
>
>Any assistance would be appreciated.
>
>
>
>
>
>LanRx Network Solutions, Inc.
>Providing Enterprise Level Solutions...On A Small Business Budget
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
|