lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <David.Spen...@micromuse.com>
Subject Re: my experiences - Re: Parsing Word Docs
Date Thu, 06 Mar 2003 18:29:07 GMT
Eric Anderson wrote:

>Ok. Thanks for the tip.
>
>I downloaded and compiled Antiword, and would like to now add it to my indexing 
>class. However, I'm not sure how the application would be called, 
>

How? You exec passing the file name and it prints the ascii text to stdout.
This method takes the file name (e.g. "c:/dir1/dir2/foo.doc") and returns
the output from antitext as one big string:

    public static String getAntiText( String fn)
        throws Throwable
    {
        Process p = null;
        InputStream is = null;
        DataInputStream dis = null;
        try
        {
            p = rt.exec( new String[] { anti, fn});
            is = p.getInputStream();
            dis = new DataInputStream( is);
            String line;
            StringBuffer sb = new StringBuffer();
            while ( ( line = dis.readLine()) != null)
            {
                //o.println( "READ: " +line);
                sb.append( line);
                sb.append( " ");
            }
            return sb.toString();
        }
        finally
        {
            try { dis.close(); } catch( Throwable t) { }
            try { is.close(); } catch( Throwable t) { }           
            try { p.waitFor(); } catch( Throwable t) { }           
            try { p.destroy(); } catch( Throwable t) { }
        }
    }
    private static String anti = "c:/antiword/antiword.exe";

>and from 
>where it would be called.
>
 From where? If the file is a word doc e.g. name ends with ".doc".

>
>How will I have the class parse the document through Antiword to create the 
>keyword index, but leaving the DOC intact, as Mr. Litchfield did with PDFBox?
>
Hmmm not sure what the exact issue is but is this the answer:

doc.add( Field.Text( "contents", new StringReader( getAntiText( 
file_name_of_word_file))));

>
>Your assistance is greatly appreciated.
>
>Eric Anderson
>815-505-6132
>
>
>Quoting David Spencer <David.Spencer@micromuse.com>:
>
>  
>
>>FYI I tried the textmining.org/poi combo and on a collection of 350 word
>>docs people have developed here over the years, and it failed on 33% of
>>them
>>with exceptions being thrown about the formats being invalid.
>>
>>I tried "antiword" ( http://www.winfield.demon.nl/ ), a native & free 
>>*.exe, and
>>it worked great ( well it seemed to process all the files fine).
>>
>>I've had similar experiences with PDF - I tried the 3 or so 
>>freeware/java PDF
>>text extractors and they were not as good as the exe, pdftotext,
>>from foolabs (http://www.foolabs.com/xpdf/).
>>
>>Not satisfying to a java developer but these work better than anything 
>>else I can find.
>>
>>You get source and I use them on windows & linux, no prob.
>>
>>
>>
>>Eric Anderson wrote:
>>
>>    
>>
>>>I'm interested in using the textmining/textextraction utilities using Apache
>>>      
>>>
>>>POI, that Ryan was discussing. However, I'm having some difficulty
>>>      
>>>
>>determining 
>>    
>>
>>>what the insertion point would be to replace the default parser with the
>>>      
>>>
>>word 
>>    
>>
>>>parser. 
>>>
>>>Any assistance would be appreciated.
>>>
>>>
>>>
>>>
>>>
>>>LanRx Network Solutions, Inc.
>>>Providing Enterprise Level Solutions...On A Small Business Budget
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>> 
>>>
>>>      
>>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>LanRx Network Solutions, Inc.
>Providing Enterprise Level Solutions...On A Small Business Budget
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message