lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Anderson <Eric.Ander...@LanRx.com>
Subject Re: my experiences - Re: Parsing Word Docs
Date Thu, 06 Mar 2003 12:03:58 GMT
I'll go either way, but I still don't know how to implement the word parser, as 
opposed to the PDF parser or HTM parser.

Eric Anderson
LanRx Network Solutions


Quoting Ryan Ackley <sackley@cfl.rr.com>:

> Eric,
> 
> The problem with antiword is that it is a native application. You must
> write
> a class that uses JNI to access the native code. If you link your java code
> with native code you have lost one of the biggest benefits of Java,
> platform
> independence. I would suggest you use the library at http://textmining.org.
> contrary to what David Spencer says, it should work on all documents
> created
> with Word 97 or above. I have literally indexed 100,000s of unique
> documents
> using my library.
> 
> Ryan Ackley
> 
> ----- Original Message -----
> From: "Eric Anderson" <Eric.Anderson@LanRx.com>
> To: "Lucene Users List" <lucene-user@jakarta.apache.org>
> Sent: Wednesday, March 05, 2003 7:14 PM
> Subject: Re: my experiences - Re: Parsing Word Docs
> 
> 
> > Ok. Thanks for the tip.
> >
> > I downloaded and compiled Antiword, and would like to now add it to my
> indexing
> > class. However, I'm not sure how the application would be called, and
> from
> > where it would be called.
> >
> > How will I have the class parse the document through Antiword to create
> the
> > keyword index, but leaving the DOC intact, as Mr. Litchfield did with
> PDFBox?
> >
> > Your assistance is greatly appreciated.
> >
> > Eric Anderson
> > 815-505-6132
> >
> >
> > Quoting David Spencer <David.Spencer@micromuse.com>:
> >
> > > FYI I tried the textmining.org/poi combo and on a collection of 350
> word
> > > docs people have developed here over the years, and it failed on 33% of
> > > them
> > > with exceptions being thrown about the formats being invalid.
> > >
> > > I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
> > > *.exe, and
> > > it worked great ( well it seemed to process all the files fine).
> > >
> > > I've had similar experiences with PDF - I tried the 3 or so
> > > freeware/java PDF
> > > text extractors and they were not as good as the exe, pdftotext,
> > > from foolabs (http://www.foolabs.com/xpdf/).
> > >
> > > Not satisfying to a java developer but these work better than anything
> > > else I can find.
> > >
> > > You get source and I use them on windows & linux, no prob.
> > >
> > >
> > >
> > > Eric Anderson wrote:
> > >
> > > >I'm interested in using the textmining/textextraction utilities using
> Apache
> > >
> > > >POI, that Ryan was discussing. However, I'm having some difficulty
> > > determining
> > > >what the insertion point would be to replace the default parser with
> the
> > > word
> > > >parser.
> > > >
> > > >Any assistance would be appreciated.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >LanRx Network Solutions, Inc.
> > > >Providing Enterprise Level Solutions...On A Small Business Budget
> > > >
> > > >---------------------------------------------------------------------
> > > >To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > >For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > > >
> > > >
> > > >
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > LanRx Network Solutions, Inc.
> > Providing Enterprise Level Solutions...On A Small Business Budget
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

LanRx Network Solutions, Inc.
Providing Enterprise Level Solutions...On A Small Business Budget

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message