lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cassandra Targett (Confluence)" <conflue...@apache.org>
Subject [CONF] Apache Solr Reference Guide > Uploading Data with Index Handlers
Date Wed, 18 Sep 2013 15:32:01 GMT
Space: Apache Solr Reference Guide (https://cwiki.apache.org/confluence/display/solr)
Page: Uploading Data with Index Handlers (https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers)


Edited by Cassandra Targett:
---------------------------------------------------------------------
{section}
{column:width=75%}
Index Handlers are Update Handlers designed to add, delete and update documents to the index.
Solr includes several of these to allow indexing documents in XML, CSV and JSON.

The example URLs given here reflect the handler configuration in the supplied {{solrconfig.xml}}.
If the name associated with the handler is changed then the URLs will need to be modified.
It is possible to access the same handler using more than one name, which can be useful if
you wish to specify different sets of default options.

New {{UpdateProcessors}} now default to the {{uniqueKey}} field if it is the appropriate type
for configured fields. The processors automatically add fields with new UUIDs and Timestamps
to {{SolrInputDocuments}}. These work similarly to the <field default="..."/> option
in {{schema.xml}}, but are applied in the {{UpdateProcessorChain}}. They may be used prior
to other {{UpdateProcessors}}, or to generate a {{uniqueKey}} field value when using the {{DistributedUpdateProcessor}}
(i.e., SolrCloud), {{TimestampUpdateProcessorFactory}}, {{UUIDUpdateProcessorFactory}}, and
{{DefaultValueUpdateProcessorFactory}}.
{column}

{column:width=25%}
{panel}
Index Handlers covered in this section:
{toc:minLevel=3|maxLevel=3}
{panel}
{column}
{section}

h2. Combined UpdateRequestHandlers

Prior to Solr 4, uploading content with an update request handler required defining the format
of the content in the request. Now, the separate XML, CSV, JSON, and javabin update request
handlers choose the appropriate {{ContentStreamLoader}} based on the {{Content-Type}} header.


The request handler can also be entered with the {{qt}} (query type) parameter matching the
name of registered handlers. The "standard" request handler is the default and will be used
if {{qt}} is not specified in the request.

{topofpage}
h3. XMLUpdateRequestHandler for XML-formatted Data

h4. Configuration

The default configuration file has the update request handler configured by default.

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
{code}

h4. Adding Documents

Documents are added to the index by sending an XML message to the update handler.

The XML schema recognized by the update handler is very straightforward:

* The {{<add>}} element introduces one more documents to be added.
* The {{<doc>}} element introduces the fields making up a document.
* The {{<field>}} element presents the content for a specific field.

For example:

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<add>
  <doc>
   <field name="authors">Patrick Eagar</field>
   <field name="subject">Sports</field>
   <field name="dd">796.35</field>
   <field name="numpages">128</field>
   <field name="desc"></field>
   <field name="price">12.40</field>
   <field name="title" boost="2.0">Summer of the all-rounder: Test and championship
cricket in England 1982</field>
   <field name="isbn">0002166313</field>
   <field name="yearpub">1982</field>
   <field name="publisher">Collins</field>
  </doc>
  <doc boost="2.5">
  ...
  </doc>
</add>
{code}

If the document schema defines a  unique key, then an {{/update}} operation silently replaces
a document in the index with the same unique key, unless the {{<add>}} element sets
the {{allowDups}} attribute to {{true}}. If no unique key has been defined, indexing performance
is somewhat faster, as no search has to be made for an existing document.

Each element has certain optional attributes which may be specified.

|| Command || Command Description || Optional Parameter || Parameter Description ||
| <add> | Introduces one or more documents to be added to the index. | commitWithin=_number_
| Add the document within the specified number of milliseconds |
| <doc> | Introduces the definition of a specific document. | boost=_float_ | Default
is 1.0. Sets a boost value for the document.To learn more about boosting, see [Searching].
|
| <field> | Defines a field within a document. | boost=_float_ | Default is 1.0. Sets
a boost value for the field. |

{note}
Other optional parameters for {{<add>}}, including {{allowDups}}, {{overwritePending}},
and {{overwriteCommitted}}, are now deprecated. However, you can specify {{overwrite=false}}
for XML updates to avoid overwriting.
{note}

h4. Commit and Optimize Operations

The {{<commit>}} operation writes all documents loaded since the last commit to one
or more segment files on the disk. Before a commit has been issued, newly indexed content
is not visible to searches. The commit operation opens a new searcher, and triggers any event
listeners that have been configured.

Commits may be issued explicitly with a {{<commit/>}} message, and can also be triggered
from {{<autocommit>}} parameters in {{solrconfig.xml}}.

The {{<optimize>}} operation requests Solr to merge internal data structures in order
to improve search performance.  For a large index, optimization will take some time to complete,
but by merging many small segment files into a larger one, search performance will improve.
If you are using Solr's replication mechanism to distribute searches across many systems,
be aware that after an optimize, a complete index will need to be transferred. In contrast,
post-commit transfers are usually much smaller.

The {{<commit>}} and {{<optimize>}} elements accept these optional attributes:

|| Optional Attribute || Description ||
| maxSegments | Default is 1. Optimizes the index to include no more than this number of segments.
|
| waitFlush | Default is true. Blocks until index changes are flushed to disk. |
| waitSearcher | Default is true. Blocks until a new searcher is opened and registered as
the main query searcher, making the changes visible. |
| expungeDeletes | Default is false. Merges segments and removes deleted documents. |

Here are examples of <commit> and <optimize> using optional attributes:

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<commit waitFlush="false" waitSearcher="false"/>
<commit waitFlush="false" waitSearcher="false" expungeDeletes="true"/>
<optimize waitFlush="false" waitSearcher="false"/>
{code}

h4. Delete Operations

Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with
the specified ID, and can be used only if a UniqueID field has been defined in the schema.
"Delete by Query" deletes all documents matching a specified query, although {{commitWithin}}
is ignored for a Delete by Query. A single delete message can contain multiple delete operations.

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<delete>
  <id>0002166313</id>
  <id>0031745983</id>
  <query>subject:sport</query>
  <query>publisher:penguin</query>
</delete>
{code}

h4. Rollback Operations

The rollback command rolls back all add and deletes made to the index since the last commit.
It neither calls any event listeners nor creates a new searcher. Its syntax is simple: {{<rollback/>}}.

h4. Using {{curl}} to Perform Updates with the Update Request Handler.

You can use the {{curl}} utility to perform any of the above commands, using its {{\--data-binary}}
option to append the XML message to the {{curl}} command, and generating a  HTTP POST request.
For example:

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
curl http://localhost:8983/update -H "Content-Type: text/xml" --data-binary '
<add>
 <doc>
  <field name="authors">Patrick Eagar</field>
  <field name="subject">Sports</field>
  <field name="dd">796.35</field>
  <field name="isbn">0002166313</field>
  <field name="yearpub">1982</field>
  <field name="publisher">Collins</field>
 </doc>
</add>'
{code}

For posting XML messages contained in a file, you can use the alternative form:

{code:language=none|borderStyle=solid|borderColor=#666666}
curl http://localhost:8983/update -H "Content-Type: text/xml" --data-binary @myfile.xml
{code}

Short requests can also be sent using a HTTP GET command, URL-encoding the request, as in
the following. Note the escaping of "<" and ">":

{code:language=none|borderStyle=solid|borderColor=#666666}
curl http://localhost:8983/update?stream.body=%3Ccommit/%3E
{code}

Responses from Solr take the form shown here:

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<?xml version="1.0" encoding="UTF-8"?>
<response>
 <lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">127</int>
 </lst>
</response>
{code}

The status field will be non-zero in case of failure. The servlet container will generate
an appropriate HTML-formatted message in the case of an error at the HTTP layer.

h3. XSLTRequestHandler to Transform XML Content

h4. Configuration

The default configuration file has the update request handler configured by default, although
the "lazy load" flag is set.

The XSLTRequestHandler allows you to index any XML data with the [XML {{<tr>}} command|http://xmlstar.sourceforge.net/doc/UG/ch04s02.html].
You must have an XSLT stylesheet in the solr/conf/xslt directory that can transform the incoming
data to the expected {{<add><doc/></add>}} format.

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<requestHandler name="/update/xslt" startup="lazy" class="solr.XsltUpdateRequestHandler"/>
{code}

Here is an example XSLT stylesheet:

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:template match="/">
    <add>
      <xsl:apply-templates select="/random/document"/>
    </add>
  </xsl:template>

  <xsl:template match="document">

    <doc boost="5.5">
      <xsl:apply-templates select="*"/>
    </doc>
  </xsl:template>

  <xsl:template match="node">
    <field name="{@name}">
      <xsl:if test="@enhance!=''">
        <xsl:attribute name="boost"><xsl:value-of select="@enhance"/></xsl:attribute>
      </xsl:if>
      <xsl:value-of select="@value"/>
    </field>
  </xsl:template>

</xsl:stylesheet>
{code}

Attaching the stylesheet "updateXml.xsl" transforms a  search result to Solr's {{UpdateXml}}
syntax. One example is to copy a Solr1.3 index (which does not have CSV response writer) into
a format   which can be indexed into another Solr file (provided that all fields are stored):

{code:language=none|borderStyle=solid|borderColor=#666666}
http://localhost:8983/solr/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=1000
{code}

You can also use the stylesheet in {{XsltUpdateRequestHandler}} to transform an index when
updating:

{code:language=none|borderStyle=solid|borderColor=#666666}
curl "http://localhost:8983/solr/update/xslt?commit=true&tr=updateX
{code}

For more information about the XML Update Request Handler, see [https://wiki.apache.org/solr/UpdateXmlMessages].

{topofpage}

h3. Using the JSONRequestHandler for JSON Content

JSON formatted update requests may be sent to Solr using the {{/solr/update/json}} URL. All
of the normal methods for uploading content are supported.

h4. Configuration

The default configuration file has the update request handler configured by default, although
the "lazy load" flag is set.

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<requestHandler name="/update/json" class="solr.JsonUpdateRequestHandler" startup="lazy"
/>
{code}

h4. Examples

There is a sample JSON file at {{example/exampledocs/books.json}} that you can use to add
documents to the Solr example server.

{code:language=none|borderStyle=solid|borderColor=#666666}
cd example/exampledocs
curl 'http://localhost:8983/solr/update/json?commit=true'
 --data-binary @books.json -H 'Content-type:application/json'
{code}

Adding {{commit=true}} to the URL makes the documents immediately searchable.

You should now be able to query for the newly added documents:

{{[http://localhost:8983/solr/select?q=title:monsters&wt=json&indent=true]}} returns:

{code:language=none|borderStyle=solid|borderColor=#666666}
{
  "responseHeader":{
    "status":0,
    "QTime":2,
    "params":{
      "indent":"true",
      "wt":"json",
      "q":"title:monsters"}},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"978-1423103349",
        "author":"Rick Riordan",
        "series_t":"Percy Jackson and the Olympians",
        "sequence_i":2,
        "genre_s":"fantasy",
        "inStock":true,
        "price":6.49,
        "pages_i":304,
        "title":[
          "The Sea of Monsters"],
        "cat":["book","paperback"]}]
  }
}
{code}

h4. Update Commands

The JSON update handler accepts all of the update commands that the XML update handler supports,
through a straightforward mapping. Multiple commands may be contained in one message:

{code:language=none|borderStyle=solid|borderColor=#666666}
{
"add": {
  "doc": {
    "id": "DOC1",
    "my_boosted_field": {        /* use a map with boost/value for a boosted field */
      "boost": 2.3,
      "value": "test"
    },
    "my_multivalued_field": [ "aaa", "bbb" ]   /* use an array for a multi-valued field */
  }
},
"add": {
  "commitWithin": 5000,          /* commit this document within 5 seconds */
  "overwrite": false,            /* don't check for existing documents with the same uniqueKey
*/
  "boost": 3.45,                 /* a document boost */
  "doc": {
    "f1": "v1",
    "f1": "v2"
  }
},

"commit": {},
"optimize": { "waitFlush":false, "waitSearcher":false },

"delete": { "id":"ID" },         /* delete by ID */
"delete": { "query":"QUERY" }    /* delete by query */
}
{code}

{note}
Comments are not allowed JSON, but duplicate names are.
{note}

As with other update handlers, parameters such as {{commit}}, {{commitWithin}}, {{optimize}},
and {{overwrite}} may be specified in the URL instead of in the body of the message.

The JSON update format allows for a simple delete-by-id. The value of a {{delete}} can be
an array which contains a list of zero or more specific document id's (not a range) to be
deleted. For example:

{code:language=none|borderStyle=solid|borderColor=#666666}
"delete":"myid"
{code}

{code:borderStyle=solid|borderColor=#666666}
"delete":["id1","id2"]
{code}

The value of a "delete" can be an array   which contains a list of zero or more id's to be
deleted. It is not a range (start and end).

You can also specify {{\_version\_}} with each "delete":
{code:language=none|borderStyle=solid|borderColor=#666666}
String str = "{'delete':'id':50, '_version_':12345}"
{code}
You can specify the version of deletes in the body of the update request as well.

For more information about the JSON Update Request Handler, see [https://wiki.apache.org/solr/UpdateJSON].

{topofpage}

h3. CSVRequestHandler for CSV Content

h4. Configuration

The default configuration file has the update request handler configured by default, although
the "lazy load" flag is set.

{code:language=html/xml|borderStyle=solid|borderColor=#666666}
<requestHandler name="/update/csv" class="solr.CSVRequestHandler" startup="lazy" />
{code}

h4. Parameters

The CSV handler allows the specification of many parameters in the URL in the form: {{f.}}{{{}{_}parameter{_}{}}}{{.}}{{{}{_}optional_fieldname{_}{}}}{{=}}{{{}{_}value{_}}}.

The table below describes the parameters for the update handler.

|| Parameter || Usage || Global (g) or Per Field (f) || Example ||
| separator | Character used as field separator; default is "," | g,(f: see split) | separator=%
|
| trim | If true, remove leading and trailing whitespace from values. Default=false. | g,f
| f.isbn.trim=true \\
trim=false |
| header | Set to true if first line of input contains field names. These will be used if
the *field_name* parameter is absent. | g | |
| field_name | Comma separated list of field names to use when adding documents. | g | field_name=isbn,price,title
|
| literal.<field_name> | Comma separated list of field names to use when processing
literal values. | g | literal.color=red,blue,black |
| skip | Comma separated list of field names to skip. | g | skip=uninteresting,shoesize |
| skipLines | Number of lines to discard in the input stream before the CSV data starts, including
the header, if present. Default=0. | g | skipLines=5 |
| encapsulator | The character optionally used to surround values to preserve characters such
as the CSV separator or whitespace. This standard CSV format handles the encapsulator itself
appearing in an encapsulated value by doubling the encapsulator. | g,(f: see split) | encapsulator="
|
| escape | The character used for escaping CSV separators or other reserved characters. If
an escape is specified, the encapsulator is not used unless also explicitly specified since
most formats use either encapsulation or escaping, not both | g | escape=\ \\ |
| keepEmpty | Keep and index zero length (empty) fields. Default=false. | g,f | f.price.keepEmpty=true
|
| map | Map one value to another. Format is value:replacement (which can be empty.) | g,f
| map=left:right \\
f.subject.map=history:bunk |
| split | If true, split a field into multiple values by a separate parser. | f | |
| overwrite | If true (the default), check for and overwrite duplicate documents, based on
the uniqueKey field declared in the Solr schema. If you know the documents you are indexing
do not contain any duplicates then you may see a considerable speed up setting this to false.
| g | |
| commit | Issues a commit after the data has been ingested. | g | |
| commitWithin | Add the document within the specified number of milliseconds. | g | commitWithin=10000
|
| rowid | Map the rowid (line number) to a field specified by the value of the parameter,
for instance if your CSV doesn't have a unique key and you want to use the row id as such.
| g | rowid=id|
| rowidOffset | Add the given offset (as an int) to the rowid before adding it to the document.
 Default is 0 | g | rowidOffset=10|

For more information on the CSV Update Request Handler, see [https://wiki.apache.org/solr/UpdateCSV].

{topofpage}

{scrollbar}


Stop watching space: https://cwiki.apache.org/confluence/users/removespacenotification.action?spaceKey=solr
Change email notification preferences: https://cwiki.apache.org/confluence/users/editmyemailsettings.action


    

Mime
View raw message