lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wayne Graham <wsg...@wm.edu>
Subject Re: Indexing MSword Documents
Date Fri, 08 Jun 2007 19:10:30 GMT
Jim,

There are a few things you can do to make extracting text easier on
yourself. There are several libraries that can assist you, POI and
TextMining.org both have excellent text extractors for Word.

As Mathieu suggests, you need to take a look at Document. Essentially,
you do everything you're doing and when it gets time to insert your
content, you'll do something along the lines of (this is using
TextMining.org's extractor):

String content = new WordExtractor().extractText(new FileInputStream(file));

doc.add(new Field("content", content, Field.Store.NO,
Field.Index.TOKENIZED));

You should be able to edit the example code you're working with fairly
easily with the above.

HTH,
Wayne

jim shirreffs wrote:
> I looked at nutches code but it is too complicated for me to follow.
> 
> I do not understand the guts of Lucene and how analyzers, parsers,
> readers, etc all fit together. I suppose I will be forced to learn it
> all someday but at the moment I am adhering to KISS, Keep It Simple Stupid.
> 
> thanks for taking the time to reply
> 
> 
> jim s
> 
> ----- Original Message ----- From: "Mathieu Lecarme"
> <mathieu@garambrogne.net>
> To: <java-user@lucene.apache.org>
> Sent: Friday, June 08, 2007 12:48 PM
> Subject: Re: Indexing MSword Documents
> 
> 
> Why don't use Document?
> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
> org/apache/lucene/document/Document.html
> 
> HTMLDocument manage HTML stuff like encoding, header, and other
> specificity.
> 
> Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
> org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
> not the more difficult part.
> 
> M.
> 
> Le 8 juin 07 à 19:23, jim shirreffs a écrit :
> 
>> Hi,
>> I am trying to index msword documents. I've got things working but  I
>> do not think I am doing things properly.
>>
>> To index msword docs I use an extractor to extract the text. Then I 
>> write the text to a .txt file and index that using an HTMLDocument 
>> object. Seems to me that since I have the text I should be able to 
>> just do a
>>
>>        Doc.add("content", the_text_from_the_word_doc, ???, ???);
>>
>> But looking at Document.java it seems the field "content" requires  a
>> reader. So I write a temporary file to satified that requirement.
>>
>> What I would like to have is an MSWORDDocument class that would  take
>> the extracted text as a argument to the constructor and create  a
>> Ducument object that I could get.
>>
>> If any one has any idea, please let me know.
>>
>> Here is my code segment. Notice the msword hack,
>>
>>
>> /*
>> * make a document
>> */
>>
>> try
>> {
>>   if (ftype.startsWith("text"))
>>   {
>>      doc = HTMLDocument.Document(f);
>>   }
>>   else if (ftype.equals("application/pdf"))
>>   {
>>      doc = LucenePDFDocument.getDocument(f);
>>   }
>>   else if (ftype.equals("application/msword"))
>>   {
>>      FileInputStream fin = new FileInputStream(f.getAbsolutePath());
>>      WordExtractor extractor = new WordExtractor(fin);
>>      String content = extractor.getText();
>>      if(debug) System.out.println(content);
>>      String tempFileName=f.getAbsolutePath() + ".txt";
>>      BufferedWriter bw = new BufferedWriter(new FileWriter
>> (tempFileName, false));
>>      bw.write((String) content.toString());
>>      bw.close();
>>      File df = new File(tempFileName);
>>      doc = HTMLDocument.Document(df);
>>      df.delete();
>>   }
>>   else if (ftype.equals("binary"))
>>   {
>>      return null;
>>   }
>>   else
>>   {
>>      if(debug) System.out.println("Unknown file type not ascii or 
>> pdf.");
>>      doc = HTMLDocument.Document(f);
>>   }
>> }
>> catch(java.lang.InterruptedException ie)
>> {
>>   throw ie;
>> }
>> catch(java.io.IOException ioe)
>> {
>>   throw ioe;
>> }
>>
>>
>>
>>
>>
>> Thanks in advance
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

-- 
/**
 * Wayne Graham
 * Earl Gregg Swem Library
 * PO Box 8794
 * Williamsburg, VA 23188
 * 757.221.3112
 * http://swem.wm.edu/blogs/waynegraham/
 */


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message