lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jim shirreffs" <j...@verizon.net>
Subject Re: Indexing MSword Documents
Date Fri, 08 Jun 2007 18:35:42 GMT
I looked at nutches code but it is too complicated for me to follow.

I do not understand the guts of Lucene and how analyzers, parsers, readers, 
etc all fit together. I suppose I will be forced to learn it all someday but 
at the moment I am adhering to KISS, Keep It Simple Stupid.

thanks for taking the time to reply


jim s

----- Original Message ----- 
From: "Mathieu Lecarme" <mathieu@garambrogne.net>
To: <java-user@lucene.apache.org>
Sent: Friday, June 08, 2007 12:48 PM
Subject: Re: Indexing MSword Documents


Why don't use Document?
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
org/apache/lucene/document/Document.html

HTMLDocument manage HTML stuff like encoding, header, and other
specificity.

Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
not the more difficult part.

M.

Le 8 juin 07 à 19:23, jim shirreffs a écrit :

> Hi,
> I am trying to index msword documents. I've got things working but  I do 
> not think I am doing things properly.
>
> To index msword docs I use an extractor to extract the text. Then I  write 
> the text to a .txt file and index that using an HTMLDocument  object. 
> Seems to me that since I have the text I should be able to  just do a
>
>        Doc.add("content", the_text_from_the_word_doc, ???, ???);
>
> But looking at Document.java it seems the field "content" requires  a 
> reader. So I write a temporary file to satified that requirement.
>
> What I would like to have is an MSWORDDocument class that would  take the 
> extracted text as a argument to the constructor and create  a Ducument 
> object that I could get.
>
> If any one has any idea, please let me know.
>
> Here is my code segment. Notice the msword hack,
>
>
> /*
> * make a document
> */
>
> try
> {
>   if (ftype.startsWith("text"))
>   {
>      doc = HTMLDocument.Document(f);
>   }
>   else if (ftype.equals("application/pdf"))
>   {
>      doc = LucenePDFDocument.getDocument(f);
>   }
>   else if (ftype.equals("application/msword"))
>   {
>      FileInputStream fin = new FileInputStream(f.getAbsolutePath());
>      WordExtractor extractor = new WordExtractor(fin);
>      String content = extractor.getText();
>      if(debug) System.out.println(content);
>      String tempFileName=f.getAbsolutePath() + ".txt";
>      BufferedWriter bw = new BufferedWriter(new FileWriter (tempFileName, 
> false));
>      bw.write((String) content.toString());
>      bw.close();
>      File df = new File(tempFileName);
>      doc = HTMLDocument.Document(df);
>      df.delete();
>   }
>   else if (ftype.equals("binary"))
>   {
>      return null;
>   }
>   else
>   {
>      if(debug) System.out.println("Unknown file type not ascii or  pdf.");
>      doc = HTMLDocument.Document(f);
>   }
> }
> catch(java.lang.InterruptedException ie)
> {
>   throw ie;
> }
> catch(java.io.IOException ioe)
> {
>   throw ioe;
> }
>
>
>
>
>
> Thanks in advance
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message