lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Anderson <Eric.Ander...@LanRx.com>
Subject Re: my experiences - Re: Parsing Word Docs
Date Thu, 06 Mar 2003 00:14:51 GMT
Ok. Thanks for the tip.

I downloaded and compiled Antiword, and would like to now add it to my indexing 
class. However, I'm not sure how the application would be called, and from 
where it would be called.

How will I have the class parse the document through Antiword to create the 
keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox?

Your assistance is greatly appreciated.

Eric Anderson
815-505-6132


Quoting David Spencer <David.Spencer@micromuse.com>:

> FYI I tried the textmining.org/poi combo and on a collection of 350 word
> docs people have developed here over the years, and it failed on 33% of
> them
> with exceptions being thrown about the formats being invalid.
> 
> I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free 
> *.exe, and
> it worked great ( well it seemed to process all the files fine).
> 
> I've had similar experiences with PDF - I tried the 3 or so 
> freeware/java PDF
> text extractors and they were not as good as the exe, pdftotext,
> from foolabs (http://www.foolabs.com/xpdf/).
> 
> Not satisfying to a java developer but these work better than anything 
> else I can find.
> 
> You get source and I use them on windows & linux, no prob.
> 
> 
> 
> Eric Anderson wrote:
> 
> >I'm interested in using the textmining/textextraction utilities using Apache
> 
> >POI, that Ryan was discussing. However, I'm having some difficulty
> determining 
> >what the insertion point would be to replace the default parser with the
> word 
> >parser. 
> >
> >Any assistance would be appreciated.
> >
> >
> >
> >
> >
> >LanRx Network Solutions, Inc.
> >Providing Enterprise Level Solutions...On A Small Business Budget
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >  
> >
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message