lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "TikaEntityProcessor" by NoblePaul
Date Fri, 11 Dec 2009 07:34:52 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "TikaEntityProcessor" page has been changed by NoblePaul.
http://wiki.apache.org/solr/TikaEntityProcessor?action=diff&rev1=1&rev2=2

--------------------------------------------------

  = Configuration =
  Sample configuration
  {{{
- 
- <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}"
format="text">
+ <entity processor="TikaEntityProcessor" tikaConfig="tikaconfig.xml" url="${some.var.goes.here}"
dataSource="bin" format="text">
        <!--Do appropriate mapping here  meta="true" means it is a metadata field -->
        <field column="Author" meta="true" name="author"/>
        <field column="title" meta="true" name="docTitle"/>
        <!--'text' is an implicit field emited by TikaEntityProcessor . Map it appropriately-->
        <field column="text"/>
- </entity>
+ </entity>  
-   
  }}}
+ === attributes ===
+  * url : (required) The url to the source. This depends on the !DataSource being used
+  * tikaConfig : (optional).The tika config file . If missing , default config is used. If
the path is relative it is w.r.t the conf dir. 
+  * format : (optional) output format. values are text|xml|html|none . default is 'text'.
irrespective of the format, the body is emitted as a field called 'text'.   Just that the
content format would be different. Use 'none' if the body is not to be parsed i.e only metadata
is emitted.
+  * parser : (optional) Default is org.apache.tika.parser.!AutoDetectParser . Povide a FQN
of a class which implements org.apache.tika.parser.Parser
  
+ ==== fields ====
+ Each field may have an optional attribute meta="true". Which means this field is to be obtained
from the !MetaData of the document. The column value is used as the key on metadata. Checkout
the list of available keys from here [[http://svn.apache.org/viewvc/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/metadata/DublinCore.java?revision=801678&view=markup
| DublinCore]] , [[http://svn.apache.org/viewvc/lucene/tika/trunk/tika-core/src/main/java/org/apache/tika/metadata/MSOffice.java?revision=801678&view=markup
|MSOffice]]
+ 

Mime
View raw message