lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Trivial Update of "ExtractingRequestHandler" by GrantIngersoll
Date Fri, 14 Nov 2008 18:47:04 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/ExtractingRequestHandler

------------------------------------------------------------------------------
  
  = Getting Started =
  
- * Check out Solr trunk
+  * Check out Solr trunk
- * Apply the patch: <!> TODO: PATCH NAME HERE <!>
+  * Apply the patch: <!> TODO: PATCH NAME HERE <!>
- * Add http://people.apache.org/~gsingers/tika-0.2-SNAPSHOT-standalone.jar to your solr-trunk/lib
(the lib directory)
+  * Add http://people.apache.org/~gsingers/tika-0.2-SNAPSHOT-standalone.jar to your solr-trunk/lib
(the lib directory)
- * ant clean example  // build the example
+  * ant clean example  // build the example
- * cd example
+  * cd example
- * java -jar start.jar
+  * java -jar start.jar
  
  In a separate window, post a file:
  
- *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" 
+  *  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 -F "myfile=@tutorial.html" 
  
  or
  
- * curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 --data-binary @tutorial.html  -H 'Content-type:text/html'  
+  * curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text
 --data-binary @tutorial.html  -H 'Content-type:text/html'  
         <!> NOTE, this literally streams the file, which does not, then, provide info
to Solr about the name of the file, which means the !ExtractingRequestHandler will auto-generate
an ID for the file, unless you specify one by adding a literal value (see below).
  
  or whatever other way you know how to do it.
  
  = Input Parameters =
  
- * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata attribute
to a Solr field name.  If no mapping is specified, the metadata attribute will be used as
the field name.  If the field name doesn't exist, it can be ignored by setting the "ignore
undeclared fields" (ext.ignore.und.fl) attribute described below
+  * ext.map.<Tika Metadata Attribute> = Solr Field Name - Map a Tika metadata attribute
to a Solr field name.  If no mapping is specified, the metadata attribute will be used as
the field name.  If the field name doesn't exist, it can be ignored by setting the "ignore
undeclared fields" (ext.ignore.und.fl) attribute described below
- * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The NAME value
is the name of the Solr field (not the Tika metadata name). 
+  * ext.boost.<NAME> = Float -  Boost the field with the specified name.  The NAME
value is the name of the Solr field (not the Tika metadata name). 
- * ext.literal.<NAME> = <VALUE> - Create a field on the document with field name
NAME and literal value VALUE, e.g. ext.literal.foo=bar.
+  * ext.literal.<NAME> = <VALUE> - Create a field on the document with field
name NAME and literal value VALUE, e.g. ext.literal.foo=bar.
- * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
+  * ext.ignore.und.fl = true|false - Default is false.  If true, ignore fields that are extracted
but are not in the Solr Schema.  Otherwise, an exception will be thrown for fields that are
not mapped.
- * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See http://incubator.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
+  * ext.xpath = <XPath expression> - When extracting, only return Tika XHTML content
that satisfies the XPath expression.  See http://incubator.apache.org/tika/documentation.html
for details on the format of Tika XHTML.  See also TikaExtractOnlyExampleOutput.
- * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
+  * ext.extract.only = true|false - Default is false.  If true, return the extracted content
from Tika without indexing the document.  This literally includes the extracted XHTML as a
<str> in the response.  See TikaExtractOnlyExampleOutput.
- * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
+  * ext.idx.attr = true|false - Index the Tika XHTML attributes into separate fields, named
after the attribute.  For example, when extracting from HTML, Tika can return the href values
of <a> tags as attributes of a tag name.  See the examples below.
- * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
+  * ext.def.fl = <NAME> - The name of the field to add the default content to.  See
also ext.capture below.  This NAME is not mapped, but it can be boosted.
- * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.
+  * ext.capture = <Tika XHTML NAME> - Capture fields with the name separately for adding
to the Solr document.
  
  
  = Examples =

Mime
View raw message