lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <David.Spen...@micromuse.com>
Subject Re: my experiences - Re: Parsing Word Docs
Date Thu, 06 Mar 2003 18:32:54 GMT
Ryan Ackley wrote:

>David,
>
>The textmining.org stuff only works on Word97 and above. It should work with
>
Could be we had pre word97 docs as some date from 1996 when we (Lumos at 
least)
were founded.

>no exceptions on any Word 97 doc. If you have any problems then it is from
>an earlier version (most likely Word 6.0) or its not a word document. If
>this isn't the case you need to email me so I can fix it and make it better
>for the benefit of everyone. I plan on adding support for Word 6 in the
>future.
>
>Ryan Ackley
>
>----- Original Message -----
>From: "David Spencer" <David.Spencer@micromuse.com>
>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>Sent: Wednesday, March 05, 2003 6:24 PM
>Subject: my experiences - Re: Parsing Word Docs
>
>
>  
>
>>FYI I tried the textmining.org/poi combo and on a collection of 350 word
>>docs people have developed here over the years, and it failed on 33% of
>>    
>>
>them
>  
>
>>with exceptions being thrown about the formats being invalid.
>>
>>I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
>>*.exe, and
>>it worked great ( well it seemed to process all the files fine).
>>
>>I've had similar experiences with PDF - I tried the 3 or so
>>freeware/java PDF
>>text extractors and they were not as good as the exe, pdftotext,
>>from foolabs (http://www.foolabs.com/xpdf/).
>>
>>Not satisfying to a java developer but these work better than anything
>>else I can find.
>>
>>You get source and I use them on windows & linux, no prob.
>>
>>
>>
>>Eric Anderson wrote:
>>
>>    
>>
>>>I'm interested in using the textmining/textextraction utilities using
>>>      
>>>
>Apache
>  
>
>>>POI, that Ryan was discussing. However, I'm having some difficulty
>>>      
>>>
>determining
>  
>
>>>what the insertion point would be to replace the default parser with the
>>>      
>>>
>word
>  
>
>>>parser.
>>>
>>>Any assistance would be appreciated.
>>>
>>>
>>>
>>>
>>>
>>>LanRx Network Solutions, Inc.
>>>Providing Enterprise Level Solutions...On A Small Business Budget
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message