lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jim shirreffs" <j...@verizon.net>
Subject Re: IndexWriter.Optimize() is too slow and IOException! How Can I do?
Date Fri, 08 Jun 2007 17:18:42 GMT

I am trying to index msword documents. I’ve got things working but I do not 
think I am doing things properly.

To index msword docs I use an extractor to extract the text. Then I write 
the text to a .txt file and index that using an HTLMDocument object. Seems 
to me that since I have the text I should be able to just do a

        Doc.add("content", the_text_from_the_word_doc, ???, ???);

But looking at Document.java it seems the field "content" requires a reader. 
So I write a temporary file to satified that requirement.

What I would like to have is an MSWORDDocument class that would take the 
extracted text as a argument to the constructor and create a Ducument object 
that I could get.

If any one has any idea, please let me know.

Here is a code segment. Notice the msword hack,


/*

* make a document

*/

try

{

   if (ftype.startsWith("text"))

   {

      doc = HTMLDocument.Document(f);

   }

   else if (ftype.equals("application/pdf"))

   {

      doc = LucenePDFDocument.getDocument(f);

   }

   else if (ftype.equals("application/msword"))

   {

      FileInputStream fin = new FileInputStream(f.getAbsolutePath());

      WordExtractor extractor = new WordExtractor(fin);

      String content = extractor.getText();

      if(debug) System.out.println(content);

      String tempFileName=f.getAbsolutePath() + ".txt";

      BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName, 
false));

      bw.write((String) content.toString());

      bw.close();

      File df = new File(tempFileName);

      doc = HTMLDocument.Document(df);

      df.delete();

   }

   else if (ftype.equals("binary"))

   {

      return null;

   }

   else

   {

      if(debug) System.out.println("Unknown file type not ascii or pdf.");

      doc = HTMLDocument.Document(f);

   }

}

catch(java.lang.InterruptedException ie)

{

   throw ie;

}

catch(java.io.IOException ioe)

{

   throw ioe;

}





Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message