lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Heinrich" <gregor.heinr...@igd.fraunhofer.de>
Subject RE: Word Documents
Date Mon, 15 Dec 2003 14:48:08 GMT
Hi,

that's great info. In fact, I didn't check for fast-saving, yet. So I'll
probably go ahead an have a try later...

Good luck for POI,

Gregor



-----Original Message-----
From: Ryan Ackley [mailto:sackley@cfl.rr.com]
Sent: Monday, December 15, 2003 3:35 PM
To: Lucene Users List; heinrich@igd.fhg.de
Subject: Re: Word Documents


I have written a library located at http://textmining.org that will extract
text from Word documents. I am the author of the Word library in POI btw.
This is just a lightweight version because I got sick of everyone asking how
to extract text from a Word document. If it doesn't work its because the
document is *not* from Word 97 or later or the file was fast-saved.
Everytime somebody has problems they send me their files and they turn out
to be RTF or Word 95 documents. You can check the format by opening the file
in Word then going to Save As. The format of the document will be in the
"Save as Type" dropdown. At least in my version of Word it does.

-Ryan

----- Original Message -----
From: "Gregor Heinrich" <gregor.heinrich@igd.fraunhofer.de>
To: "'Lucene Users List'" <lucene-user@jakarta.apache.org>
Sent: Monday, December 15, 2003 9:19 AM
Subject: RE: Word Documents


> Hi,
>
> we had some problems using the POI Word filter. In one document set,
> everything would work fine, in another more than 50% documents refused to
> work with it (does not index). I am not an OLE2 pro and cannot see any
> apparent difference in the documents between the different sets. The
version
> used was Word 97 in almost all the docs. For the moment, I switched to a
> native converter (that does not process metadata and must be run using
> Runtime.exec(), though) until I have time to revisit the problem.
>
> I do not want to disrecommend the POI-filters, it's a very cool idea.
Please
> do try your particular document set with it. For a quick test, you can use
> the Docco personal search tool by Peter Becker and colleagues (available
> from SourceForge). It has a current version of POI included as a plugin
and
> Lucene running as indexing backend. So you don't have to write code to get
> answers...
>
> Cheers, gregor
>
> -----Original Message-----
> From: Pleasant, Tracy [mailto:tracy.pleasant@lmco.com]
> Sent: Monday, December 15, 2003 2:58 PM
> To: Lucene Users List
> Subject: Word Documents
>
>
> As a spinoff, I was wondering if anyone has been happy with indexing and
> searching Word docs. What about reading the contents? Any problems?
>
>
> -----Original Message-----
> From: Ryan Ackley [mailto:sackley@cfl.rr.com]
> Sent: Friday, December 12, 2003 5:59 PM
> To: Zhou, Oliver; Lucene Users List
> Subject: Re: textmining: document title
>
>
> Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the
HPSF
> API. It allows you to extract metadata like Title, Author, etc. from OLE
> documents.
>
> -Ryan
>
> ----- Original Message -----
> From: "Zhou, Oliver" <Oliver.Zhou@cignabehavioral.com>
> To: <sackley@cfl.rr.com>
> Sent: Friday, December 12, 2003 5:26 PM
> Subject: textmining: document title
>
>
> > Ryan,
> >
> > I'm using textmining and lucene to index word documents but don't know
how
> > to get word document title.  Your advice on this matter is appreciated.
> >
> > Thanks,
> > Oliver Zhou
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message