nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Nutch Wiki] Update of "TikaPlugin" by JulienNioche
Date Wed, 16 Dec 2009 10:13:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "TikaPlugin" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/TikaPlugin?action=diff&rev1=2&rev2=3

--------------------------------------------------

  = Tika Plugin =
- The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at
delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch.
This page will list the differences in coverage or functionality between the Tika plugin and
the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not
described here.
+ The Tika plugin in http://issues.apache.org/jira/browse/NUTCH-766 is a first attempt at
delegating the parsing to Tika instead of having to maintain the parser plugins in Nutch.
This page will list the differences in coverage or functionality between the Tika plugin and
the existing Nutch parsers. Tika also has more formats not covered by Nutch which are not
described here and has a more generic capability of representing structured content which
can be useful for HtmlParseFilters (which are currently limited to HTML content).
  
  '''html''': ?
  
@@ -9, +9 @@

  
  '''mp3''': ?
  
- '''msexcel''': ?
+ '''msexcel''': comparable (+ Tika able to represent content in structured way as XHTML tables
which can be useful for HTML parser plugins)
  
- '''mspowerpoint''': ?
+ '''mspowerpoint''': comparable
  
- '''msword''': ?
+ '''msword''': Tika does not support word 95 other versions are comparable
  
- '''openoffice''': ?
+ '''openoffice''': comparable
  
- '''pdf''': ?
+ '''pdf''': comparable
  
  '''rss''': ?
  
- '''rtf''': ?
+ '''rtf''': comparable
  
  '''swf''' : not yet covered in Tika (see https://issues.apache.org/jira/browse/TIKA-337)
  
  '''text''': ?
  
- '''zip''': ?    not covered in Tika
+ '''zip''': ?
  

Mime
View raw message