lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by DanielNaber
Date Wed, 31 Dec 2008 16:39:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by DanielNaber:
http://wiki.apache.org/lucene-java/LuceneFAQ

The comment on the change is:
link tika

------------------------------------------------------------------------------
  See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing,
and searching XML with Digester and Lucene].
  
  
- ==== How can I index OpenOffice.org files? ====
+ ==== How can I index file formats like OpenDocument, MS-Word, MS-Excel, etc? ====
  
+ Have a look at [http://lucene.apache.org/tika/ Tika, the content analysis toolkit].
- These files (.sxw, .sxc, etc) are ZIP archives that contain XML files. Uncompress
- the file using Java's ZIP support, then parse meta.xml to get title etc.
- and content.xml to get the document's content. Add these to the Lucene index,
- typically using one Lucene field per property.
  
+ Some background information: Many modern office file formats (.odt, .sxw, .sxc, etc) are
ZIP archives that contain XML files. You can uncompress the file using Java's ZIP support,
then parse e.g. meta.xml to get the title and e.g. content.xml to get the document's content.
You can then add these to the Lucene index, typically using one Lucene field per property.
- Note that this applies to !OpenOffice.org 1.x, things have changed a bit for !OpenOffice.org
- 2.x, but the basic approach is still the same.
  
- You can also use LIUS framework for indexing OpenOffice documents(http://www.bibl.ulaval.ca/lius/).
LIUS allow metadata and fulltext indexing, using XPath.
+ You can also use LIUS framework for indexing !OpenOffice.org documents (http://www.bibl.ulaval.ca/lius/).
LIUS allows metadata and fulltext indexing, using XPath.
  
+ For MS-Word, MS-Excel, MS-Visio, and MS-Powerpoint you might also want to take a look at
[http://poi.apache.org Apache POI].
- 
- ==== How can I index MS-Word documents? ====
- 
- In order to index Word documents you need to first parse them to extract text that you want
to index from them.  Here are some Word parsers that can help you with that:
- 
- [http://poi.apache.org/hwpf/ Apache POI] has an early development level Microsoft Word parser
for versions of Word from Office 97, 2000, and XP.
- 
- ==== How can I index MS-Excel documents? ====
- 
- In order to index Excel documents you need to first parse them to extract text that you
want to index from them.  Here are some Excel parsers that can help you with that:
- 
- [http://poi.apache.org/hssf/ Apache POI] has an excellent Microsoft Excel parser for versions
of Excel from Office 97, 2000, and XP.  You can also modify Excel files with this tool.
- 
- ==== How can I index MS-Powerpoint documents? ====
- 
- In order to index Powerpoint documents you need to first parse them to extract text that
you want to index from them.  You can use the [http://poi.apache.org/hslf/  Apache POI], as
it contains a parser for Powerpoint documents.
- 
- ==== How can I index MS-Visio documents? ====
- 
- In order to index Visio documents you need to first parse them to extract text that you
want to index from them.  You can use the [http://poi.apache.org/hdgf/ Apache POI], as it
contains a parser for Visio documents.
- 
  
  ==== How can I index Email (from MS-Exchange or another IMAP server) ? ====
  

Mime
View raw message