lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <David.Spen...@micromuse.com>
Subject Re: my experiences - Re: Parsing Word Docs
Date Thu, 06 Mar 2003 18:31:49 GMT
Ryan Ackley wrote:

>Eric,
>
>The problem with antiword is that it is a native application. You must write
>a class that uses JNI to access the native code. 
>
No you don't. Just use Runtime.exec - no JNI :)

>If you link your java code
>with native code you have lost one of the biggest benefits of Java, platform
>
Yeah but given that the source for antitext is avail and it runs on all 
platforms
I use (windows/linux/sun) and works better than anything else (given 
that it seems
to accept older formats than POI/textmining) it seems to get the job 
done better.

>independence. I would suggest you use the library at http://textmining.org.
>contrary to what David Spencer says, it should work on all documents created
>with Word 97 or above. I have literally indexed 100,000s of unique documents
>using my library.
>
>Ryan Ackley
>
>----- Original Message -----
>From: "Eric Anderson" <Eric.Anderson@LanRx.com>
>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>Sent: Wednesday, March 05, 2003 7:14 PM
>Subject: Re: my experiences - Re: Parsing Word Docs
>
>
>  
>
>>Ok. Thanks for the tip.
>>
>>I downloaded and compiled Antiword, and would like to now add it to my
>>    
>>
>indexing
>  
>
>>class. However, I'm not sure how the application would be called, and from
>>where it would be called.
>>
>>How will I have the class parse the document through Antiword to create
>>    
>>
>the
>  
>
>>keyword index, but leaving the DOC intact, as Mr. Litchfield did with
>>    
>>
>PDFBox?
>  
>
>>Your assistance is greatly appreciated.
>>
>>Eric Anderson
>>815-505-6132
>>
>>
>>Quoting David Spencer <David.Spencer@micromuse.com>:
>>
>>    
>>
>>>FYI I tried the textmining.org/poi combo and on a collection of 350 word
>>>docs people have developed here over the years, and it failed on 33% of
>>>them
>>>with exceptions being thrown about the formats being invalid.
>>>
>>>I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free
>>>*.exe, and
>>>it worked great ( well it seemed to process all the files fine).
>>>
>>>I've had similar experiences with PDF - I tried the 3 or so
>>>freeware/java PDF
>>>text extractors and they were not as good as the exe, pdftotext,
>>>from foolabs (http://www.foolabs.com/xpdf/).
>>>
>>>Not satisfying to a java developer but these work better than anything
>>>else I can find.
>>>
>>>You get source and I use them on windows & linux, no prob.
>>>
>>>
>>>
>>>Eric Anderson wrote:
>>>
>>>      
>>>
>>>>I'm interested in using the textmining/textextraction utilities using
>>>>        
>>>>
>Apache
>  
>
>>>>POI, that Ryan was discussing. However, I'm having some difficulty
>>>>        
>>>>
>>>determining
>>>      
>>>
>>>>what the insertion point would be to replace the default parser with
>>>>        
>>>>
>the
>  
>
>>>word
>>>      
>>>
>>>>parser.
>>>>
>>>>Any assistance would be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>LanRx Network Solutions, Inc.
>>>>Providing Enterprise Level Solutions...On A Small Business Budget
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>
>>>>
>>>>
>>>>        
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>      
>>>
>>LanRx Network Solutions, Inc.
>>Providing Enterprise Level Solutions...On A Small Business Budget
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message