lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jim shirreffs" <j...@verizon.net>
Subject Indexing MSword Documents
Date Fri, 08 Jun 2007 17:23:27 GMT
Hi,
I am trying to index msword documents. I've got things working but I do not 
think I am doing things properly.

To index msword docs I use an extractor to extract the text. Then I write 
the text to a .txt file and index that using an HTMLDocument object. Seems 
to me that since I have the text I should be able to just do a

        Doc.add("content", the_text_from_the_word_doc, ???, ???);

But looking at Document.java it seems the field "content" requires a reader. 
So I write a temporary file to satified that requirement.

What I would like to have is an MSWORDDocument class that would take the 
extracted text as a argument to the constructor and create a Ducument object 
that I could get.

If any one has any idea, please let me know.

Here is my code segment. Notice the msword hack,


/*
* make a document
*/

try
{
   if (ftype.startsWith("text"))
   {
      doc = HTMLDocument.Document(f);
   }
   else if (ftype.equals("application/pdf"))
   {
      doc = LucenePDFDocument.getDocument(f);
   }
   else if (ftype.equals("application/msword"))
   {
      FileInputStream fin = new FileInputStream(f.getAbsolutePath());
      WordExtractor extractor = new WordExtractor(fin);
      String content = extractor.getText();
      if(debug) System.out.println(content);
      String tempFileName=f.getAbsolutePath() + ".txt";
      BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName, 
false));
      bw.write((String) content.toString());
      bw.close();
      File df = new File(tempFileName);
      doc = HTMLDocument.Document(df);
      df.delete();
   }
   else if (ftype.equals("binary"))
   {
      return null;
   }
   else
   {
      if(debug) System.out.println("Unknown file type not ascii or pdf.");
      doc = HTMLDocument.Document(f);
   }
}
catch(java.lang.InterruptedException ie)
{
   throw ie;
}
catch(java.io.IOException ioe)
{
   throw ioe;
}





Thanks in advance


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message