lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: IndexWriter.Optimize() is too slow and IOException! How Can I do?
Date Fri, 08 Jun 2007 19:32:38 GMT
First, when asking a new question, it's best to start a new subject.
Your question has nothing to do with the rest of the thread....

That said, you want to create a Reader to pass along. I'd think about
doing this by subclassing your MSWord class from the Reader class
and providing the necessary implementation of the abstract read method.

Best
Erick

On 6/8/07, jim shirreffs <jpsb@verizon.net> wrote:
>
>
> I am trying to index msword documents. I've got things working but I do
> not
> think I am doing things properly.
>
> To index msword docs I use an extractor to extract the text. Then I write
> the text to a .txt file and index that using an HTLMDocument object. Seems
> to me that since I have the text I should be able to just do a
>
>         Doc.add("content", the_text_from_the_word_doc, ???, ???);
>
> But looking at Document.java it seems the field "content" requires a
> reader.
> So I write a temporary file to satified that requirement.
>
> What I would like to have is an MSWORDDocument class that would take the
> extracted text as a argument to the constructor and create a Ducument
> object
> that I could get.
>
> If any one has any idea, please let me know.
>
> Here is a code segment. Notice the msword hack,
>
>
> /*
>
> * make a document
>
> */
>
> try
>
> {
>
>    if (ftype.startsWith("text"))
>
>    {
>
>       doc = HTMLDocument.Document(f);
>
>    }
>
>    else if (ftype.equals("application/pdf"))
>
>    {
>
>       doc = LucenePDFDocument.getDocument(f);
>
>    }
>
>    else if (ftype.equals("application/msword"))
>
>    {
>
>       FileInputStream fin = new FileInputStream(f.getAbsolutePath());
>
>       WordExtractor extractor = new WordExtractor(fin);
>
>       String content = extractor.getText();
>
>       if(debug) System.out.println(content);
>
>       String tempFileName=f.getAbsolutePath() + ".txt";
>
>       BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName,
> false));
>
>       bw.write((String) content.toString());
>
>       bw.close();
>
>       File df = new File(tempFileName);
>
>       doc = HTMLDocument.Document(df);
>
>       df.delete();
>
>    }
>
>    else if (ftype.equals("binary"))
>
>    {
>
>       return null;
>
>    }
>
>    else
>
>    {
>
>       if(debug) System.out.println("Unknown file type not ascii or pdf.");
>
>       doc = HTMLDocument.Document(f);
>
>    }
>
> }
>
> catch(java.lang.InterruptedException ie)
>
> {
>
>    throw ie;
>
> }
>
> catch(java.io.IOException ioe)
>
> {
>
>    throw ioe;
>
> }
>
>
>
>
>
> Thanks in advance
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message