Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 64979 invoked from network); 8 Jun 2007 17:24:11 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 8 Jun 2007 17:24:11 -0000 Received: (qmail 51443 invoked by uid 500); 8 Jun 2007 17:24:08 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 51408 invoked by uid 500); 8 Jun 2007 17:24:07 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 51394 invoked by uid 99); 8 Jun 2007 17:24:07 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2007 10:24:07 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of jpsb@verizon.net designates 206.46.252.44 as permitted sender) Received: from [206.46.252.44] (HELO vms044pub.verizon.net) (206.46.252.44) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 08 Jun 2007 10:24:01 -0700 Received: from dorthy ([71.248.153.228]) by vms044.mailsrvcs.net (Sun Java System Messaging Server 6.2-6.01 (built Apr 3 2006)) with ESMTPA id <0JJB0098ZUZ2UN30@vms044.mailsrvcs.net> for java-user@lucene.apache.org; Fri, 08 Jun 2007 12:23:29 -0500 (CDT) Date: Fri, 08 Jun 2007 12:23:27 -0500 From: "jim shirreffs" Subject: Indexing MSword Documents To: Message-id: <00b701c7a9f1$bb4bada0$2e01a8c0@dorthy> MIME-version: 1.0 X-MIMEOLE: Produced By Microsoft MimeOLE V6.00.2900.2962 X-Mailer: Microsoft Outlook Express 6.00.2900.2869 Content-type: text/plain; format=flowed; charset=iso-8859-1; reply-type=response Content-transfer-encoding: 7bit X-Priority: 3 X-MSMail-priority: Normal X-Virus-Checked: Checked by ClamAV on apache.org Hi, I am trying to index msword documents. I've got things working but I do not think I am doing things properly. To index msword docs I use an extractor to extract the text. Then I write the text to a .txt file and index that using an HTMLDocument object. Seems to me that since I have the text I should be able to just do a Doc.add("content", the_text_from_the_word_doc, ???, ???); But looking at Document.java it seems the field "content" requires a reader. So I write a temporary file to satified that requirement. What I would like to have is an MSWORDDocument class that would take the extracted text as a argument to the constructor and create a Ducument object that I could get. If any one has any idea, please let me know. Here is my code segment. Notice the msword hack, /* * make a document */ try { if (ftype.startsWith("text")) { doc = HTMLDocument.Document(f); } else if (ftype.equals("application/pdf")) { doc = LucenePDFDocument.getDocument(f); } else if (ftype.equals("application/msword")) { FileInputStream fin = new FileInputStream(f.getAbsolutePath()); WordExtractor extractor = new WordExtractor(fin); String content = extractor.getText(); if(debug) System.out.println(content); String tempFileName=f.getAbsolutePath() + ".txt"; BufferedWriter bw = new BufferedWriter(new FileWriter(tempFileName, false)); bw.write((String) content.toString()); bw.close(); File df = new File(tempFileName); doc = HTMLDocument.Document(df); df.delete(); } else if (ftype.equals("binary")) { return null; } else { if(debug) System.out.println("Unknown file type not ascii or pdf."); doc = HTMLDocument.Document(f); } } catch(java.lang.InterruptedException ie) { throw ie; } catch(java.io.IOException ioe) { throw ioe; } Thanks in advance --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org