lucene-java-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Lucene-java Wiki] Update of "LuceneFAQ" by HossMan
Date Sat, 03 Jan 2009 05:45:31 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-java Wiki" for change notification.

The following page has been changed by HossMan:
http://wiki.apache.org/lucene-java/LuceneFAQ

The comment on the change is:
more tika refs, and consolidate RTF

------------------------------------------------------------------------------
  
  ==== How can I index HTML documents? ====
  
- In order to index HTML documents you need to first parse them to extract text that you want
to index from them.  Here are some HTML parsers that can help you with that:
+ In order to index HTML documents you need to first parse them to extract text that you want
to index from them.  Have a look at [http://lucene.apache.org/tika/ Tika, the content analysis
toolkit].
+ 
+ Alternately...
  
  An example that uses JavaCC to parse HTML into Lucene Document  objects is provided in the
[http://lucene.apache.org/java/docs/demo3.html Lucene web application demo] that comes with
the Lucene distribution.
  
@@ -652, +654 @@

  
  ==== How can I index XML documents? ====
  
- In order to index XML documents you need to first parse them to extract text that you want
to index from them.  Here are some XML parsers that can help you with that:
+ In order to index XML documents you need to first parse them to extract text that you want
to index from them.  Have a look at [http://lucene.apache.org/tika/ Tika, the content analysis
toolkit].
  
- See article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing, indexing,
and searching XML with Digester and Lucene].
+ See also this article [http://www-106.ibm.com/developerworks/library/j-lucene/ Parsing,
indexing, and searching XML with Digester and Lucene].
  
  
- ==== How can I index file formats like OpenDocument (aka OpenOffice.org), Microsoft Word,
Excel, PowerPoint, Visio, etc? ====
+ ==== How can I index file formats like OpenDocument (aka OpenOffice.org), RTF, Microsoft
Word, Excel, PowerPoint, Visio, etc? ====
  
  Have a look at [http://lucene.apache.org/tika/ Tika, the content analysis toolkit].
  
@@ -666, +668 @@

  You can also use LIUS framework for indexing !OpenOffice.org documents (http://www.bibl.ulaval.ca/lius/).
LIUS allows metadata and fulltext indexing, using XPath.
  
  For MS-Word, MS-Excel, MS-Visio, and MS-Powerpoint you might also want to take a look at
[http://poi.apache.org Apache POI].
+ 
+ Lucene In Action contains an example of how to extract text from RTF files using the  [http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/%3c8c7324c56edcc88bf9c4e58495409b29@ehatchersolutions.com%3e
Swing RTFEditorKit class].
  
  ==== How can I index Email (from MS-Exchange or another IMAP server) ? ====
  
@@ -673, +677 @@

   * http://www.tropo.com/techno/java/lucene/imap.html
   * http://guests.evectors.it/zoe/
  
- ==== How can I index RTF documents? ====
- 
- In order to index RTF documents you need to first parse them to extract text that you want
to index from them.  Lucene In Action contains an example of how to do this using the [http://mail-archives.apache.org/mod_mbox/lucene-java-user/200504.mbox/%3c8c7324c56edcc88bf9c4e58495409b29@ehatchersolutions.com%3e
Swing RTFEditorKit class].
- 
  ==== How can I index PDF documents? ====
  
  In order to index PDF documents you need to first parse them to extract text that you want
to index from them.  Here are some PDF parsers that can help you with that:
@@ -697, +697 @@

  How to parse the output of the JSP depends on the type of content that the JSP generates.
 In most cases the content is going to be in HTML format.
  
  Most importantly, do not try to index JSPs by treating them as normal files in your file
system.  In order to index JSPs properly you need to access them via HTTP, acting like a Web
client.
- 
  
  ==== How can I index java source files? ====
  

Mime
View raw message