lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lucene-...@jakarta.apache.org
Subject [Jakarta Lucene Wiki] Updated: LuceneFAQ
Date Thu, 30 Dec 2004 21:19:03 GMT
   Date: 2004-12-30T13:19:03
   Editor: DanielNaber
   Wiki: Jakarta Lucene Wiki
   Page: LuceneFAQ
   URL: http://wiki.apache.org/jakarta-lucene/LuceneFAQ

   no comment

Change Log:

------------------------------------------------------------------------------
@@ -445,6 +445,17 @@
 See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing, and
searching XML with Digester and Lucene].
 
 
+==== How can I index OpenOffice.org files? ====
+
+These files (.sxw, .sxc, etc) are ZIP archives that contain XML files. Uncompress
+the file using Java's ZIP support, then parse meta.xml to get title etc.
+and content.xml to get the document's content. Add these to the Lucene index,
+typically using one Lucene field per property.
+
+Note that this applies to OpenOffice.org 1.x, things might change a bit for OpenOffice.org
+2.x, but the basic approach will still be the same.
+
+
 ==== How can I index MS-Word documents? ====
 
 In order to index Word documents you need to first parse them to extract text that you want
to index from them.  Here are some Word parsers that can help you with that:

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message