lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gregor Heinrich" <gregor.heinr...@igd.fraunhofer.de>
Subject RE: Word Documents
Date Mon, 15 Dec 2003 14:19:39 GMT
Hi,

we had some problems using the POI Word filter. In one document set,
everything would work fine, in another more than 50% documents refused to
work with it (does not index). I am not an OLE2 pro and cannot see any
apparent difference in the documents between the different sets. The version
used was Word 97 in almost all the docs. For the moment, I switched to a
native converter (that does not process metadata and must be run using
Runtime.exec(), though) until I have time to revisit the problem.

I do not want to disrecommend the POI-filters, it's a very cool idea. Please
do try your particular document set with it. For a quick test, you can use
the Docco personal search tool by Peter Becker and colleagues (available
from SourceForge). It has a current version of POI included as a plugin and
Lucene running as indexing backend. So you don't have to write code to get
answers...

Cheers, gregor

-----Original Message-----
From: Pleasant, Tracy [mailto:tracy.pleasant@lmco.com]
Sent: Monday, December 15, 2003 2:58 PM
To: Lucene Users List
Subject: Word Documents


As a spinoff, I was wondering if anyone has been happy with indexing and
searching Word docs. What about reading the contents? Any problems?


-----Original Message-----
From: Ryan Ackley [mailto:sackley@cfl.rr.com]
Sent: Friday, December 12, 2003 5:59 PM
To: Zhou, Oliver; Lucene Users List
Subject: Re: textmining: document title


Check out jakarta POI (http://jakarta.apache.org/poi ) particularly the HPSF
API. It allows you to extract metadata like Title, Author, etc. from OLE
documents.

-Ryan

----- Original Message ----- 
From: "Zhou, Oliver" <Oliver.Zhou@cignabehavioral.com>
To: <sackley@cfl.rr.com>
Sent: Friday, December 12, 2003 5:26 PM
Subject: textmining: document title


> Ryan,
>
> I'm using textmining and lucene to index word documents but don't know how
> to get word document title.  Your advice on this matter is appreciated.
>
> Thanks,
> Oliver Zhou
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message