lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "ExtractingRequestHandler" by GrantIngersoll
Date Sat, 15 Nov 2008 16:17:29 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The following page has been changed by GrantIngersoll:
http://wiki.apache.org/solr/ExtractingRequestHandler

------------------------------------------------------------------------------
   * It is useful to keep in mind what a given operation is using for input when specifying
parameters.  For instance, captured fields are specified to the !SolrContentHandler for capturing
content in the Tika XHTML.  Thus, the names of the fields are those of the XHTML, not the
mapped names.
   * A default field name is required for indexing, but not for extraction only.
   * The default field name and any literal values are not mapped.  They can be boosted. 
See the examples.
+ 
+ == Identifiers ==
+ 
+ If you do not pass in a value for a unique ID field, and your schema requires one, the !SolrContentHandler
will attempt to generate an ID for you.  The code for this looks like:
+ {{{
+   protected String generateId(SchemaField uniqueField) {
+     //we don't have a unique field specified, so let's add one
+     String uniqId = null;
+     FieldType type = uniqueField.getType();
+     if (type instanceof StrField || type instanceof TextField) {
+       uniqId = metadata.get(ExtractingMetadataConstants.STREAM_NAME);
+       if (uniqId == null) {
+         uniqId = metadata.get(ExtractingMetadataConstants.STREAM_SOURCE_INFO);
+       }
+       if (uniqId == null) {
+         uniqId = metadata.get(Metadata.IDENTIFIER);
+       }
+       if (uniqId == null) {
+         //last chance, just create one
+         uniqId = UUID.randomUUID().toString();
+       }
+     } else if (type instanceof UUIDField){
+       uniqId = UUID.randomUUID().toString();
+     }
+     else {
+       uniqId = String.valueOf(getNextId());
+     }
+     return uniqId;
+   }
+ }}}
+ 
+ NOTE, you can override this by implementing your own !SolrContentHandler as described below.
  
  = Getting Started =
  
@@ -117, +149 @@

   curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.def.fl=text\&ext.map.div=foo_t\&ext.capture=div\&ext.boost.foo_t=3\&ext.literal.blah_i=1
 -F "tutorial=@tutorial.pdf"
  }}}
  
- == Extract Only: ==
+ == Extract Only ==
  {{{
  curl http://localhost:8983/solr/update/extract?ext.idx.attr=true\&ext.extract.only=true
 --data-binary @tutorial.html  -H 'Content-type:text/html'
  }}}
@@ -125, +157 @@

  See TikaExtractOnlyExampleOutput.
  
  
- == Customizing ==
+ = Customizing =
  
  While the current !ExtractingRequestHandler only allows for the use of the !SolrContentHandler
in creating new documents, it is relatively easy to implement your own extension that processes
the Tika extracted content differently and produces a different !SolrInputDocument.
  

Mime
View raw message