lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by HossMan
Date Sat, 03 Jan 2009 05:32:01 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/lucene-java/LuceneFAQ

The comment on the change is:
be more explicit in question for people who skim instead of search

------------------------------------------------------------------------------
  See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing,
and searching XML with Digester and Lucene].
  
  
- ==== How can I index file formats like OpenDocument, MS-Word, MS-Excel, etc? ====
+ ==== How can I index file formats like OpenDocument (aka OpenOffice.org), Microsoft Word,
Excel, PowerPoint, Visio, etc? ====
  
  Have a look at [http://lucene.apache.org/tika/ Tika, the content analysis toolkit].
  
- Some background information: Many modern office file formats (.odt, .sxw, .sxc, etc) are
ZIP archives that contain XML files. You can uncompress the file using Java's ZIP support,
then parse e.g. meta.xml to get the title and e.g. content.xml to get the document's content.
You can then add these to the Lucene index, typically using one Lucene field per property.
+ Alternately: Many modern office file formats (.odt, .sxw, .sxc, etc) are ZIP archives that
contain XML files. You can uncompress the file using Java's ZIP support, then parse e.g. meta.xml
to get the title and e.g. content.xml to get the document's content. You can then add these
to the Lucene index, typically using one Lucene field per property.
  
  You can also use LIUS framework for indexing !OpenOffice.org documents (http://www.bibl.ulaval.ca/lius/).
LIUS allows metadata and fulltext indexing, using XPath.
  

Mime
View raw message