lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "jim shirreffs" <j...@verizon.net>
Subject Re: Indexing MSword Documents
Date Sat, 09 Jun 2007 14:55:55 GMT
thanks the apprach you and Donna Gresh suggested worked out fine. I now have 
a much better understanding of the Document class.


here is the create Document code in case another newie is interested. as 
more mine types are added I will expand the in if

thanks again

jim s


public class KcmiDocument
{
 String objectId;
 String title;
 String mtype;

 private boolean debug=true;

 public KcmiDocument(String oId, String t, String f)
 {
  objectId=oId;
  title = t;
  mtype=f;
  if(debug) System.out.println("In KcmiDocument with " + oId + ":" + t + ":" 
+ f);
 }

 /*
  * Makes a document for a file.
  * The document has four custom fields:
  * path, modified date, title and objectId
  */
 public Document document(File f) throws java.io.FileNotFoundException,
  java.io.IOException, java.lang.InterruptedException
 {
  if(debug) System.out.println("In KcmiDocument.document with " + 
f.getAbsolutePath());

  Document doc = null;

  /*
   * make a document
   */
  try
  {
   if (mtype.startsWith("text"))
   {
    doc = HTMLDocument.Document(f);
   }
   else if (mtype.equals("application/pdf"))
   {
    doc = LucenePDFDocument.getDocument(f);
   }
   else if (mtype.equals("application/msword"))
   {
    FileInputStream fin = new FileInputStream(f.getAbsolutePath());
    WordExtractor extractor = new WordExtractor(fin);
    String content = extractor.getText();

    if(debug) System.out.println(content);

    doc = new Document();
    doc.add(new Field("contents", content, Field.Store.NO, 
Field.Index.TOKENIZED));
   }
   else if (mtype.equals("???")) //binary
   {
    throw new java.io.IOException(f.getName() +
     " mime type can not be determined unable to extract doument content");
   }
   else
   {
    if(debug) System.out.println("Unknown using trying htmldocument.");
    doc = HTMLDocument.Document(f);
   }
  }
  catch(java.lang.InterruptedException ie)
  {
   throw ie;
  }
  catch(java.io.IOException ioe)
  {
   throw ioe;
  }

  /*
   * Add the path of the file as a field named "path".  Use a field that is
   * indexed (i.e. searchable), but don't tokenize the field into words.
   */
  doc.add(new Field("path", f.getPath(), Field.Store.YES, 
Field.Index.UN_TOKENIZED));

  /*
   * Add ObjectId
   */
  doc.add(new Field("objectId",
   objectId,
   Field.Store.YES,
   Field.Index.UN_TOKENIZED));

  /*
   * Add title
   */
  doc.add(new Field("kcmititle",
   title,
   Field.Store.YES,
   Field.Index.UN_TOKENIZED));

  /*
   * return the document
   */
  return doc;
 }
}
----- Original Message ----- 
From: "Wayne Graham" <wsgrah@wm.edu>
To: <java-user@lucene.apache.org>
Sent: Friday, June 08, 2007 2:10 PM
Subject: Re: Indexing MSword Documents


> Jim,
>
> There are a few things you can do to make extracting text easier on
> yourself. There are several libraries that can assist you, POI and
> TextMining.org both have excellent text extractors for Word.
>
> As Mathieu suggests, you need to take a look at Document. Essentially,
> you do everything you're doing and when it gets time to insert your
> content, you'll do something along the lines of (this is using
> TextMining.org's extractor):
>
> String content = new WordExtractor().extractText(new 
> FileInputStream(file));
>
> doc.add(new Field("content", content, Field.Store.NO,
> Field.Index.TOKENIZED));
>
> You should be able to edit the example code you're working with fairly
> easily with the above.
>
> HTH,
> Wayne
>
> jim shirreffs wrote:
>> I looked at nutches code but it is too complicated for me to follow.
>>
>> I do not understand the guts of Lucene and how analyzers, parsers,
>> readers, etc all fit together. I suppose I will be forced to learn it
>> all someday but at the moment I am adhering to KISS, Keep It Simple 
>> Stupid.
>>
>> thanks for taking the time to reply
>>
>>
>> jim s
>>
>> ----- Original Message ----- From: "Mathieu Lecarme"
>> <mathieu@garambrogne.net>
>> To: <java-user@lucene.apache.org>
>> Sent: Friday, June 08, 2007 12:48 PM
>> Subject: Re: Indexing MSword Documents
>>
>>
>> Why don't use Document?
>> http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/
>> org/apache/lucene/document/Document.html
>>
>> HTMLDocument manage HTML stuff like encoding, header, and other
>> specificity.
>>
>> Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
>> org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
>> not the more difficult part.
>>
>> M.
>>
>> Le 8 juin 07 à 19:23, jim shirreffs a écrit :
>>
>>> Hi,
>>> I am trying to index msword documents. I've got things working but  I
>>> do not think I am doing things properly.
>>>
>>> To index msword docs I use an extractor to extract the text. Then I
>>> write the text to a .txt file and index that using an HTMLDocument
>>> object. Seems to me that since I have the text I should be able to
>>> just do a
>>>
>>>        Doc.add("content", the_text_from_the_word_doc, ???, ???);
>>>
>>> But looking at Document.java it seems the field "content" requires  a
>>> reader. So I write a temporary file to satified that requirement.
>>>
>>> What I would like to have is an MSWORDDocument class that would  take
>>> the extracted text as a argument to the constructor and create  a
>>> Ducument object that I could get.
>>>
>>> If any one has any idea, please let me know.
>>>
>>> Here is my code segment. Notice the msword hack,
>>>
>>>
>>> /*
>>> * make a document
>>> */
>>>
>>> try
>>> {
>>>   if (ftype.startsWith("text"))
>>>   {
>>>      doc = HTMLDocument.Document(f);
>>>   }
>>>   else if (ftype.equals("application/pdf"))
>>>   {
>>>      doc = LucenePDFDocument.getDocument(f);
>>>   }
>>>   else if (ftype.equals("application/msword"))
>>>   {
>>>      FileInputStream fin = new FileInputStream(f.getAbsolutePath());
>>>      WordExtractor extractor = new WordExtractor(fin);
>>>      String content = extractor.getText();
>>>      if(debug) System.out.println(content);
>>>      String tempFileName=f.getAbsolutePath() + ".txt";
>>>      BufferedWriter bw = new BufferedWriter(new FileWriter
>>> (tempFileName, false));
>>>      bw.write((String) content.toString());
>>>      bw.close();
>>>      File df = new File(tempFileName);
>>>      doc = HTMLDocument.Document(df);
>>>      df.delete();
>>>   }
>>>   else if (ftype.equals("binary"))
>>>   {
>>>      return null;
>>>   }
>>>   else
>>>   {
>>>      if(debug) System.out.println("Unknown file type not ascii or
>>> pdf.");
>>>      doc = HTMLDocument.Document(f);
>>>   }
>>> }
>>> catch(java.lang.InterruptedException ie)
>>> {
>>>   throw ie;
>>> }
>>> catch(java.io.IOException ioe)
>>> {
>>>   throw ioe;
>>> }
>>>
>>>
>>>
>>>
>>>
>>> Thanks in advance
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> -- 
> /**
> * Wayne Graham
> * Earl Gregg Swem Library
> * PO Box 8794
> * Williamsburg, VA 23188
> * 757.221.3112
> * http://swem.wm.edu/blogs/waynegraham/
> */
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message