lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Ackley" <sack...@cfl.rr.com>
Subject Re: my experiences - Re: Parsing Word Docs
Date Thu, 06 Mar 2003 12:20:22 GMT
Eric,

The problem with antiword is that it is a native application. You must write
a class that uses JNI to access the native code. If you link your java code
with native code you have lost one of the biggest benefits of Java, platform
independence. I would suggest you use the library at http://textmining.org.
contrary to what David Spencer says, it should work on all documents created
with Word 97 or above. I have literally indexed 100,000s of unique documents
using my library.

Ryan Ackley

----- Original Message -----
From: "Eric Anderson" <Eric.Anderson@LanRx.com>
To: "Lucene Users List" <lucene-user@jakarta.apache.org>
Sent: Wednesday, March 05, 2003 7:14 PM
Subject: Re: my experiences - Re: Parsing Word Docs


> Ok. Thanks for the tip.
>
> I downloaded and compiled Antiword, and would like to now add it to my
indexing
> class. However, I'm not sure how the application would be called, and from
> where it would be called.
>
> How will I have the class parse the document through Antiword to create
the
> keyword index, but leaving the DOC intact, as Mr. Litchfield did with
PDFBox?
>
> Your assistance is greatly appreciated.
>
> Eric Anderson
> 815-505-6132
>
>
> Quoting David Spencer <David.Spencer@micromuse.com>:
>
> > FYI I tried the textmining.org/poi combo and on a collection of 350 word
> > docs people have developed here over the years, and it failed on 33% of
> > them
> > with exceptions being thrown about the formats being invalid.
> >
> > I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
> > *.exe, and
> > it worked great ( well it seemed to process all the files fine).
> >
> > I've had similar experiences with PDF - I tried the 3 or so
> > freeware/java PDF
> > text extractors and they were not as good as the exe, pdftotext,
> > from foolabs (http://www.foolabs.com/xpdf/).
> >
> > Not satisfying to a java developer but these work better than anything
> > else I can find.
> >
> > You get source and I use them on windows & linux, no prob.
> >
> >
> >
> > Eric Anderson wrote:
> >
> > >I'm interested in using the textmining/textextraction utilities using
Apache
> >
> > >POI, that Ryan was discussing. However, I'm having some difficulty
> > determining
> > >what the insertion point would be to replace the default parser with
the
> > word
> > >parser.
> > >
> > >Any assistance would be appreciated.
> > >
> > >
> > >
> > >
> > >
> > >LanRx Network Solutions, Inc.
> > >Providing Enterprise Level Solutions...On A Small Business Budget
> > >
> > >---------------------------------------------------------------------
> > >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> > >
> > >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
> LanRx Network Solutions, Inc.
> Providing Enterprise Level Solutions...On A Small Business Budget
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message