lucene-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From cpoersc...@apache.org
Subject [05/50] [abbrv] lucene-solr:jira/solr-8668: squash merge jira/solr-10290 into master
Date Fri, 12 May 2017 13:42:58 GMT
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/95968c69/solr/solr-ref-guide/src/upgrading-solr.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/upgrading-solr.adoc b/solr/solr-ref-guide/src/upgrading-solr.adoc
new file mode 100644
index 0000000..6ec78f1
--- /dev/null
+++ b/solr/solr-ref-guide/src/upgrading-solr.adoc
@@ -0,0 +1,62 @@
+= Upgrading Solr
+:page-shortname: upgrading-solr
+:page-permalink: upgrading-solr.html
+
+If you are already using Solr 6.5, Solr 6.6 should not present any major problems. However, you should review the {solr-javadocs}/changes/Changes.html[`CHANGES.txt`] file found in your Solr package for changes and updates that may effect your existing implementation. Detailed steps for upgrading a Solr cluster can be found in the appendix: <<upgrading-a-solr-cluster.adoc#upgrading-a-solr-cluster,Upgrading a Solr Cluster>>.
+
+[[UpgradingSolr-Upgradingfrom6.5.x]]
+== Upgrading from 6.5.x
+
+* <TBD>
+
+[[UpgradingSolr-Upgradingfromearlier6.xversions]]
+== Upgrading from earlier 6.x versions
+
+* If you use historical dates, specifically on or before the year 1582, you should re-index after upgrading to this version.
+* If you use the JSON Facet API (json.facet) with `method=stream`, you must now set `sort='index asc'` to get the streaming behavior; otherwise it won't stream. Reminder: "method" is a hint that doesn't change defaults of other parameters.
+* If you use the JSON Facet API (json.facet) to facet on a numeric field and if you use `mincount=0` or if you set the prefix, then you will now get an error as these options are incompatible with numeric faceting.
+* Solr's logging verbosity at the INFO level has been greatly reduced, and you may need to update the log configs to use the DEBUG level to see all the logging messages you used to see at INFO level before.
+* We are no longer backing up `solr.log` and `solr_gc.log` files in date-stamped copies forever. If you relied on the `solr_log_<date>` or `solr_gc_log_<date>` being in the logs folder that will no longer be the case. See the section <<configuring-logging.adoc#configuring-logging,Configuring Logging>> for details on how log rotation works as of Solr 6.3.
+* The create/deleteCollection methods on MiniSolrCloudCluster have been deprecated. Clients should instead use the CollectionAdminRequest API. In addition, `MiniSolrCloudCluster#uploadConfigDir(File, String)` has been deprecated in favour of `#uploadConfigSet(Path, String)`.
+* The http://solr.in[`bin/solr.in.sh`] (http://solr.in[`bin/solr.in.cmd`] on Windows) is now completely commented by default. Previously, this wasn't so, which had the effect of masking existing environment variables.
+* The `\_version_` field is no longer indexed and is now defined with `indexed=false` by default, because the field has DocValues enabled.
+* The `/export` handler has been changed so it no longer returns zero (0) for numeric fields that are not in the original document. One consequence of this change is that you must be aware that some tuples will not have values if there were none in the original document.
+* Metrics-related classes in `org.apache.solr.util.stats` have been removed in favor of the http://metrics.dropwizard.io/3.1.0/[Dropwizard metrics library]. Any custom plugins using these classes should be changed to use the equivalent classes from the metrics library. As part of this, the following changes were made to the output of Overseer Status API:
+** The "totalTime" metric has been removed because it is no longer supported.
+** The metrics "75thPctlRequestTime", "95thPctlRequestTime", "99thPctlRequestTime" and "999thPctlRequestTime" in Overseer Status API have been renamed to "75thPcRequestTime", "95thPcRequestTime" and so on for consistency with stats output in other parts of Solr.
+** The metrics "avgRequestsPerMinute", "5minRateRequestsPerMinute" and "15minRateRequestsPerMinute" have been replaced by corresponding per-second rates viz. "avgRequestsPerSecond", "5minRateRequestsPerSecond" and "15minRateRequestsPerSecond" for consistency with stats output in other parts of Solr.
+* A new highlighter named UnifiedHighlighter has been added. You are encouraged to try out the UnifiedHighlighter by setting `hl.method=unified` and report feedback. It might become the default in 7.0. It's more efficient/faster than the other highlighters, especially compared to the original Highlighter. That said, some options aren't supported yet. It will get more features in time, especially with your input. See HighlightParams.java for a listing of highlight parameters annotated with which highlighters use them. `hl.useFastVectorHighlighter` is now considered deprecated in lieu of `hl.method=fastVector`.
+* The <<query-settings-in-solrconfig.adoc#query-settings-in-solrconfig,`maxWarmingSearchers` parameter>> now defaults to 1, and more importantly commits will now block if this limit is exceeded instead of throwing an exception (a good thing). Consequently there is no longer a risk in overlapping commits. Nonetheless users should continue to avoid excessive committing. Users are advised to remove any pre-existing maxWarmingSearchers entries from their solrconfig.xml files.
+* The <<other-parsers.adoc#OtherParsers-ComplexPhraseQueryParser,Complex Phrase query parser>> now supports leading wildcards. Beware of its possible heaviness, users are encouraged to use ReversedWildcardFilter in index time analysis.
+* The JMX metric "avgTimePerRequest" (and the corresponding metric in the metrics API for each handler) used to be a simple non-decaying average based on total cumulative time and the number of requests. New Codahale Metrics implementation applies exponential decay to this value, which heavily biases the average towards the last 5 minutes.
+* Index-time boosts are now deprecated. As a replacement, index-time scoring factors should be indexed in a separate field and combined with the query score using a function query. These boosts will be removed in Solr 7.0.
+* Parallel SQL now uses Apache Calcite as its SQL framework. As part of this change the default aggregation mode has been changed to facet rather than map_reduce. There have also been changes to the SQL aggregate response and some SQL syntax changes. Consult the <<parallel-sql-interface.adoc#parallel-sql-interface,Parallel SQL Interface>> documentation for full details.
+
+[[UpgradingSolr-Upgradingfrom5.5.x]]
+== Upgrading from 5.5.x
+
+* The deprecated `SolrServer` and subclasses have been removed, use <<using-solrj.adoc#using-solrj,`SolrClient`>> instead.
+* The deprecated `<nrtMode>` configuration in <<configuring-solrconfig-xml.adoc#configuring-solrconfig-xml,`solrconfig.xml`>> has been removed. Please remove it from `solrconfig.xml`.
+* `SolrClient.shutdown()` has been removed, use {solr-javadocs}/solr-solrj/org/apache/solr/client/solrj/SolrClient.html[`SolrClient.close()`] instead.
+* The deprecated `zkCredientialsProvider` element in `solrcloud` section of `solr.xml` is now removed. Use the correct spelling (<<zookeeper-access-control.adoc#zookeeper-access-control,`zkCredentialsProvider`>>) instead.
+* Internal/expert - `ResultContext` was significantly changed and expanded to allow for multiple full query results (`DocLists`) per Solr request. `TransformContext` was rendered redundant and was removed. See https://issues.apache.org/jira/browse/SOLR-7957[SOLR-7957] for details.
+* Several changes have been made regarding the "<<other-schema-elements.adoc#OtherSchemaElements-Similarity,`Similarity`>>" used in Solr, in order to provide better default behavior for new users. There are 3 key impacts of these changes on existing users who upgrade:
+** `DefaultSimilarityFactory` has been removed. If you currently have `DefaultSimilarityFactory` explicitly referenced in your `schema.xml`, edit your config to use the functionally identical `ClassicSimilarityFactory`. See https://issues.apache.org/jira/browse/SOLR-8239[SOLR-8239] for more details.
+** The implicit default Similarity used when no `<similarity/>` is configured in `schema.xml` has been changed to `SchemaSimilarityFactory`. Users who wish to preserve back-compatible behavior should either explicitly configure `ClassicSimilarityFactory`, or ensure that the `luceneMatchVersion` for the collection is less then 6.0. See https://issues.apache.org/jira/browse/SOLR-8270[SOLR-8270] + http://SOLR-8271[SOLR-8271] for details.
+** `SchemaSimilarityFactory` has been modified to use `BM25Similarity` as the default for `fieldTypes` that do not explicitly declare a Similarity. The legacy behavior of using `ClassicSimilarity` as the default will occur if the `luceneMatchVersion` for the collection is less then 6.0, or the '`defaultSimFromFieldType`' configuration option may be used to specify any default of your choosing. See https://issues.apache.org/jira/browse/SOLR-8261[SOLR-8261] + https://issues.apache.org/jira/browse/SOLR-8329[SOLR-8329] for more details.
+* If your `solrconfig.xml` file doesn't explicitly mention the `schemaFactory` to use then Solr will choose the `ManagedIndexSchemaFactory` by default. Previously it would have chosen `ClassicIndexSchemaFactory`. This means that the Schema APIs (`/<collection>/schema`) are enabled and the schema is mutable. When Solr starts your `schema.xml` file will be renamed to `managed-schema`. If you want to retain the old behaviour then please ensure that the `solrconfig.xml` explicitly uses the `ClassicIndexSchemaFactory` or your `luceneMatchVersion` in the `solrconfig.xml` is less than 6.0. See the <<schema-factory-definition-in-solrconfig.adoc#schema-factory-definition-in-solrconfig,Schema Factory Definition in SolrConfig>> section for more details
+* `SolrIndexSearcher.QueryCommand` and `QueryResult` were moved to their own classes. If you reference them in your code, you should import them under o.a.s.search (or use your IDE's "Organize Imports").
+* The '<<request-parameters-api.adoc#request-parameters-api,`useParams`>>' attribute specified in request handler cannot be overridden from request params. See https://issues.apache.org/jira/browse/SOLR-8698[SOLR-8698] for more details.
+* When requesting stats in date fields, "sum" is now returned as a double value instead of a date. See https://issues.apache.org/jira/browse/SOLR-8671[SOLR-8671] for more details.
+* The deprecated GET methods for schema are now accessible through the <<schema-api.adoc#schema-api,bulk API>>. These methods now accept fewer request parameters, and output less information. See https://issues.apache.org/jira/browse/SOLR-8736[SOLR-8736] for more details. Some of the removed functionality will likely be restored in a future version of Solr - see https://issues.apache.org/jira/browse/SOLR-8992[SOLR-8992].
+* In the past, Solr guaranteed that retrieval of multi-valued fields would preserve the order of values. Because values may now be retrieved from column-stored fields (`docValues="true"`), in conjunction with the fact that <<docvalues.adoc#docvalues,DocValues>> do not currently preserve order, means that users should set <<defining-fields.adoc#defining-fields,`useDocValuesAsStored="false"`>> to prevent future optimizations from using the column-stored values over the row-stored values when fields have both `stored="true"` and `docValues="true"`.
+* <<working-with-dates.adoc#working-with-dates,Formatted date-times from Solr>> have some differences. If the year is more than 4 digits, there is a leading '+'. When there is a non-zero number of milliseconds, it is padded with zeros to 3 digits. Negative year (BC) dates are now possible. Parsing: It is now an error to supply a portion of the date out of its, range, like 67 seconds.
+* <<using-solrj.adoc#using-solrj,SolrJ>> no longer includes `DateUtil`. If for some reason you need to format or parse dates, simply use `Instant.format()` and `Instant.parse()`.
+* If you are using spatial4j, please upgrade to 0.6 and <<spatial-search.adoc#spatial-search,edit your `spatialContextFactory`>> to replace `com.spatial4j.core` with `org.locationtech.spatial4j` .
+
+[[UpgradingSolr-UpgradingfromOlderVersionsofSolr]]
+== Upgrading from Older Versions of Solr
+
+Users upgrading from older versions are strongly encouraged to consult {solr-javadocs}/changes/Changes.html[`CHANGES.txt`] for the details of _all_ changes since the version they are upgrading from.
+
+A summary of the significant changes between Solr 5.x and Solr 6.0 can be found in the <<major-changes-from-solr-5-to-solr-6.adoc#major-changes-from-solr-5-to-solr-6,Major Changes from Solr 5 to Solr 6>> section.

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/95968c69/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc b/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc
new file mode 100644
index 0000000..79ab2d1
--- /dev/null
+++ b/solr/solr-ref-guide/src/uploading-data-with-index-handlers.adoc
@@ -0,0 +1,546 @@
+= Uploading Data with Index Handlers
+:page-shortname: uploading-data-with-index-handlers
+:page-permalink: uploading-data-with-index-handlers.html
+:page-children: transforming-and-indexing-custom-json
+
+Index Handlers are Request Handlers designed to add, delete and update documents to the index. In addition to having plugins for importing rich documents <<uploading-data-with-solr-cell-using-apache-tika.adoc#uploading-data-with-solr-cell-using-apache-tika,using Tika>> or from structured data sources using the <<uploading-structured-data-store-data-with-the-data-import-handler.adoc#uploading-structured-data-store-data-with-the-data-import-handler,Data Import Handler>>, Solr natively supports indexing structured documents in XML, CSV and JSON.
+
+The recommended way to configure and use request handlers is with path based names that map to paths in the request url. However, request handlers can also be specified with the `qt` (query type) parameter if the <<requestdispatcher-in-solrconfig.adoc#requestdispatcher-in-solrconfig,`requestDispatcher`>> is appropriately configured. It is possible to access the same handler using more than one name, which can be useful if you wish to specify different sets of default options.
+
+A single unified update request handler supports XML, CSV, JSON, and javabin update requests, delegating to the appropriate `ContentStreamLoader` based on the `Content-Type` of the <<content-streams.adoc#content-streams,ContentStream>>.
+
+[[UploadingDatawithIndexHandlers-UpdateRequestHandlerConfiguration]]
+== UpdateRequestHandler Configuration
+
+The default configuration file has the update request handler configured by default.
+
+[source,xml]
+----
+<requestHandler name="/update" class="solr.UpdateRequestHandler" />
+----
+
+[[UploadingDatawithIndexHandlers-XMLFormattedIndexUpdates]]
+== XML Formatted Index Updates
+
+Index update commands can be sent as XML message to the update handler using `Content-type: application/xml` or `Content-type: text/xml`.
+
+[[UploadingDatawithIndexHandlers-AddingDocuments]]
+=== Adding Documents
+
+The XML schema recognized by the update handler for adding documents is very straightforward:
+
+* The `<add>` element introduces one more documents to be added.
+* The `<doc>` element introduces the fields making up a document.
+* The `<field>` element presents the content for a specific field.
+
+For example:
+
+[source,xml]
+----
+<add>
+  <doc>
+    <field name="authors">Patrick Eagar</field>
+    <field name="subject">Sports</field>
+    <field name="dd">796.35</field>
+    <field name="numpages">128</field>
+    <field name="desc"></field>
+    <field name="price">12.40</field>
+    <field name="title">Summer of the all-rounder: Test and championship cricket in England 1982</field>
+    <field name="isbn">0002166313</field>
+    <field name="yearpub">1982</field>
+    <field name="publisher">Collins</field>
+  </doc>
+  <doc>
+  ...
+  </doc>
+</add>
+----
+
+The add command supports some optional attributes which may be specified.
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="30,70",options="header"]
+|===
+|Optional Parameter |Parameter Description
+|commitWithin=_number_ |Add the document within the specified number of milliseconds
+|overwrite=_boolean_ |Default is true. Indicates if the unique key constraints should be checked to overwrite previous versions of the same document (see below)
+|===
+
+If the document schema defines a unique key, then by default an `/update` operation to add a document will overwrite (ie: replace) any document in the index with the same unique key. If no unique key has been defined, indexing performance is somewhat faster, as no check has to be made for an existing documents to replace.
+
+If you have a unique key field, but you feel confident that you can safely bypass the uniqueness check (eg: you build your indexes in batch, and your indexing code guarantees it never adds the same document more then once) you can specify the `overwrite="false"` option when adding your documents.
+
+[[UploadingDatawithIndexHandlers-XMLUpdateCommands]]
+=== XML Update Commands
+
+[[UploadingDatawithIndexHandlers-CommitandOptimizeOperations]]
+==== Commit and Optimize Operations
+
+The `<commit>` operation writes all documents loaded since the last commit to one or more segment files on the disk. Before a commit has been issued, newly indexed content is not visible to searches. The commit operation opens a new searcher, and triggers any event listeners that have been configured.
+
+Commits may be issued explicitly with a `<commit/>` message, and can also be triggered from `<autocommit>` parameters in `solrconfig.xml`.
+
+The `<optimize>` operation requests Solr to merge internal data structures in order to improve search performance. For a large index, optimization will take some time to complete, but by merging many small segment files into a larger one, search performance will improve. If you are using Solr's replication mechanism to distribute searches across many systems, be aware that after an optimize, a complete index will need to be transferred. In contrast, post-commit transfers are usually much smaller.
+
+The `<commit>` and `<optimize>` elements accept these optional attributes:
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="30,70",options="header"]
+|===
+|Optional Attribute |Description
+|waitSearcher |Default is true. Blocks until a new searcher is opened and registered as the main query searcher, making the changes visible.
+|expungeDeletes |(commit only) Default is false. Merges segments that have more than 10% deleted docs, expunging them in the process.
+|maxSegments |(optimize only) Default is 1. Merges the segments down to no more than this number of segments.
+|===
+
+Here are examples of <commit> and <optimize> using optional attributes:
+
+[source,xml]
+----
+<commit waitSearcher="false"/>
+<commit waitSearcher="false" expungeDeletes="true"/>
+<optimize waitSearcher="false"/>
+----
+
+[[UploadingDatawithIndexHandlers-DeleteOperations]]
+==== Delete Operations
+
+Documents can be deleted from the index in two ways. "Delete by ID" deletes the document with the specified ID, and can be used only if a UniqueID field has been defined in the schema. "Delete by Query" deletes all documents matching a specified query, although `commitWithin` is ignored for a Delete by Query. A single delete message can contain multiple delete operations.
+
+[source,xml]
+----
+<delete>
+  <id>0002166313</id>
+  <id>0031745983</id>
+  <query>subject:sport</query>
+  <query>publisher:penguin</query>
+</delete>
+----
+
+[IMPORTANT]
+====
+
+When using the Join query parser in a Delete By Query, you should use the `score` parameter with a value of " `none`" to avoid a `ClassCastException`. See the section on the <<other-parsers.adoc#other-parsers,Join Query Parser>> for more details on the `score` parameter.
+
+====
+
+[[UploadingDatawithIndexHandlers-RollbackOperations]]
+==== Rollback Operations
+
+The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls any event listeners nor creates a new searcher. Its syntax is simple: `<rollback/>`.
+
+[[UploadingDatawithIndexHandlers-UsingcurltoPerformUpdates]]
+=== Using `curl` to Perform Updates
+
+You can use the `curl` utility to perform any of the above commands, using its `--data-binary` option to append the XML message to the `curl` command, and generating a HTTP POST request. For example:
+
+[source,bash]
+----
+curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary '
+<add>
+  <doc>
+    <field name="authors">Patrick Eagar</field>
+    <field name="subject">Sports</field>
+    <field name="dd">796.35</field>
+    <field name="isbn">0002166313</field>
+    <field name="yearpub">1982</field>
+    <field name="publisher">Collins</field>
+  </doc>
+</add>'
+----
+
+For posting XML messages contained in a file, you can use the alternative form:
+
+[source,bash]
+----
+curl http://localhost:8983/solr/my_collection/update -H "Content-Type: text/xml" --data-binary @myfile.xml
+----
+
+Short requests can also be sent using a HTTP GET command, URL-encoding the request, as in the following. Note the escaping of "<" and ">":
+
+[source,bash]
+----
+curl http://localhost:8983/solr/my_collection/update?stream.body=%3Ccommit/%3E
+----
+
+Responses from Solr take the form shown here:
+
+[source,xml]
+----
+<response>
+  <lst name="responseHeader">
+    <int name="status">0</int>
+    <int name="QTime">127</int>
+  </lst>
+</response>
+----
+
+The status field will be non-zero in case of failure.
+
+[[UploadingDatawithIndexHandlers-UsingXSLTtoTransformXMLIndexUpdates]]
+=== Using XSLT to Transform XML Index Updates
+
+The UpdateRequestHandler allows you to index any arbitrary XML using the `<tr>` parameter to apply an https://en.wikipedia.org/wiki/XSLT[XSL transformation]. You must have an XSLT stylesheet in the `conf/xslt` directory of your <<config-sets.adoc#config-sets,config set>> that can transform the incoming data to the expected `<add><doc/></add>` format, and use the `tr` parameter to specify the name of that stylesheet.
+
+Here is an example XSLT stylesheet:
+
+[source,xml]
+----
+<xsl:stylesheet version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
+  <xsl:output media-type="text/xml" method="xml" indent="yes"/>
+  <xsl:template match='/'>
+    <add>
+      <xsl:apply-templates select="response/result/doc"/>
+    </add>
+  </xsl:template>
+  <!-- Ignore score (makes no sense to index) -->
+  <xsl:template match="doc/*[@name='score']" priority="100"></xsl:template>
+  <xsl:template match="doc">
+    <xsl:variable name="pos" select="position()"/>
+    <doc>
+      <xsl:apply-templates>
+        <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
+      </xsl:apply-templates>
+    </doc>
+  </xsl:template>
+  <!-- Flatten arrays to duplicate field lines -->
+  <xsl:template match="doc/arr" priority="100">
+    <xsl:variable name="fn" select="@name"/>
+    <xsl:for-each select="*">
+      <xsl:element name="field">
+        <xsl:attribute name="name"><xsl:value-of select="$fn"/></xsl:attribute>
+        <xsl:value-of select="."/>
+      </xsl:element>
+    </xsl:for-each>
+  </xsl:template>
+  <xsl:template match="doc/*">
+    <xsl:variable name="fn" select="@name"/>
+      <xsl:element name="field">
+        <xsl:attribute name="name"><xsl:value-of select="$fn"/></xsl:attribute>
+      <xsl:value-of select="."/>
+    </xsl:element>
+  </xsl:template>
+  <xsl:template match="*"/>
+</xsl:stylesheet>
+----
+
+This stylesheet transforms Solr's XML search result format into Solr's Update XML syntax. One example usage would be to copy a Solr 1.3 index (which does not have CSV response writer) into a format which can be indexed into another Solr file (provided that all fields are stored):
+
+[source,plain]
+----
+http://localhost:8983/solr/my_collection/select?q=*:*&wt=xslt&tr=updateXml.xsl&rows=1000
+----
+
+You can also use the stylesheet in `XsltUpdateRequestHandler` to transform an index when updating:
+
+[source,bash]
+----
+curl "http://localhost:8983/solr/my_collection/update?commit=true&tr=updateXml.xsl" -H "Content-Type: text/xml" --data-binary @myexporteddata.xml
+----
+
+[[UploadingDatawithIndexHandlers-JSONFormattedIndexUpdates]]
+== JSON Formatted Index Updates
+
+Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted documents. If sending arbitrarily formatted JSON, there are some additional parameters that need to be sent with the update request, described below in the section <<transforming-and-indexing-custom-json.adoc#transforming-and-indexing-custom-json,Transforming and Indexing Custom JSON>>.
+
+[[UploadingDatawithIndexHandlers-Solr-StyleJSON]]
+=== Solr-Style JSON
+
+JSON formatted update requests may be sent to Solr's `/update` handler using `Content-Type: application/json` or `Content-Type: text/json`.
+
+JSON formatted updates can take 3 basic forms, described in depth below:
+
+* <<UploadingDatawithIndexHandlers-AddingaSingleJSONDocument,A single document to add>>, expressed as a top level JSON Object. To differentiate this from a set of commands, the `json.command=false` request parameter is required.
+* <<UploadingDatawithIndexHandlers-AddingMultipleJSONDocuments,A list of documents to add>>, expressed as a top level JSON Array containing a JSON Object per document.
+* <<UploadingDatawithIndexHandlers-SendingJSONUpdateCommands,A sequence of update commands>>, expressed as a top level JSON Object (aka: Map).
+
+[[UploadingDatawithIndexHandlers-AddingaSingleJSONDocument]]
+==== Adding a Single JSON Document
+
+The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using the `/update/json/docs` path:
+
+[source,bash]
+----
+curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update/json/docs' --data-binary '
+{
+  "id": "1",
+  "title": "Doc 1"
+}'
+----
+
+[[UploadingDatawithIndexHandlers-AddingMultipleJSONDocuments]]
+==== Adding Multiple JSON Documents
+
+Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each object represents a document:
+
+[source,bash]
+----
+curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update' --data-binary '
+[
+  {
+    "id": "1",
+    "title": "Doc 1"
+  },
+  {
+    "id": "2",
+    "title": "Doc 2"
+  }
+]'
+----
+
+A sample JSON file is provided at `example/exampledocs/books.json` and contains an array of objects that you can add to the Solr `techproducts` example:
+
+[source,bash]
+----
+curl 'http://localhost:8983/solr/techproducts/update?commit=true' --data-binary @example/exampledocs/books.json -H 'Content-type:application/json'
+----
+
+[[UploadingDatawithIndexHandlers-SendingJSONUpdateCommands]]
+==== Sending JSON Update Commands
+
+In general, the JSON update syntax supports all of the update commands that the XML update handler supports, through a straightforward mapping. Multiple commands, adding and deleting documents, may be contained in one message:
+
+[source,bash,subs="verbatim,callouts"]
+----
+curl -X POST -H 'Content-Type: application/json' 'http://localhost:8983/solr/my_collection/update' --data-binary '
+{
+  "add": {
+    "doc": {
+      "id": "DOC1",
+      "my_field": 2.3,
+      "my_multivalued_field": [ "aaa", "bbb" ]   --<1>
+    }
+  },
+  "add": {
+    "commitWithin": 5000, --<2>
+    "overwrite": false,  --<3>
+    "doc": {
+      "f1": "v1", --<4>
+      "f1": "v2"
+    }
+  },
+
+  "commit": {},
+  "optimize": { "waitSearcher":false },
+
+  "delete": { "id":"ID" },  --<5>
+  "delete": { "query":"QUERY" } --<6>
+}'
+----
+
+<1> Can use an array for a multi-valued field
+<2> Commit this document within 5 seconds
+<3> Don't check for existing documents with the same uniqueKey
+<4> Can use repeated keys for a multi-valued field
+<5> Delete by ID (uniqueKey field)
+<6> Delete by Query
+
+As with other update handlers, parameters such as `commit`, `commitWithin`, `optimize`, and `overwrite` may be specified in the URL instead of in the body of the message.
+
+The JSON update format allows for a simple delete-by-id. The value of a `delete` can be an array which contains a list of zero or more specific document id's (not a range) to be deleted. For example, a single document:
+
+[source,json]
+----
+{ "delete":"myid" }
+----
+
+Or a list of document IDs:
+
+[source,json]
+----
+{ "delete":["id1","id2"] }
+----
+
+The value of a "delete" can be an array which contains a list of zero or more id's to be deleted. It is not a range (start and end).
+
+You can also specify `\_version_` with each "delete":
+
+[source,json]
+----
+{
+  "delete":"id":50,
+  "_version_":12345
+}
+----
+
+You can specify the version of deletes in the body of the update request as well.
+
+[[UploadingDatawithIndexHandlers-JSONUpdateConveniencePaths]]
+=== JSON Update Convenience Paths
+
+In addition to the `/update` handler, there are a few additional JSON specific request handler paths available by default in Solr, that implicitly override the behavior of some request parameters:
+
+[width="100%",options="header",]
+|===
+|Path |Default Parameters
+|`/update/json` |`stream.contentType=application/json`
+|`/update/json/docs` a|
+`stream.contentType=application/json`
+
+`json.command=false`
+
+|===
+
+The `/update/json` path may be useful for clients sending in JSON formatted update commands from applications where setting the Content-Type proves difficult, while the `/update/json/docs` path can be particularly convenient for clients that always want to send in documents – either individually or as a list – without needing to worry about the full JSON command syntax.
+
+[[UploadingDatawithIndexHandlers-CustomJSONDocuments]]
+=== Custom JSON Documents
+
+Solr can support custom JSON. This is covered in the section <<transforming-and-indexing-custom-json.adoc#transforming-and-indexing-custom-json,Transforming and Indexing Custom JSON>>.
+
+
+[[UploadingDatawithIndexHandlers-CSVFormattedIndexUpdates]]
+== CSV Formatted Index Updates
+
+CSV formatted update requests may be sent to Solr's `/update` handler using `Content-Type: application/csv` or `Content-Type: text/csv`.
+
+A sample CSV file is provided at `example/exampledocs/books.csv` that you can use to add some documents to the Solr `techproducts` example:
+
+[source,bash]
+----
+curl 'http://localhost:8983/solr/my_collection/update?commit=true' --data-binary @example/exampledocs/books.csv -H 'Content-type:application/csv'
+----
+
+[[UploadingDatawithIndexHandlers-CSVUpdateParameters]]
+=== CSV Update Parameters
+
+The CSV handler allows the specification of many parameters in the URL in the form: `f._parameter_._optional_fieldname_=_value_` .
+
+The table below describes the parameters for the update handler.
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="20,40,20,20",options="header"]
+|===
+|Parameter |Usage |Global (g) or Per Field (f) |Example
+|separator |Character used as field separator; default is "," |g,(f: see split) |separator=%09
+|trim |If true, remove leading and trailing whitespace from values. Default=false. |g,f |f.isbn.trim=true trim=false
+|header |Set to true if first line of input contains field names. These will be used if the *fieldnames* parameter is absent. |g |
+|fieldnames |Comma separated list of field names to use when adding documents. |g |fieldnames=isbn,price,title
+|literal.<field_name> |A literal value for a specified field name. |g |literal.color=red
+|skip |Comma separated list of field names to skip. |g |skip=uninteresting,shoesize
+|skipLines |Number of lines to discard in the input stream before the CSV data starts, including the header, if present. Default=0. |g |skipLines=5
+|encapsulator |The character optionally used to surround values to preserve characters such as the CSV separator or whitespace. This standard CSV format handles the encapsulator itself appearing in an encapsulated value by doubling the encapsulator. |g,(f: see split) |encapsulator="
+|escape |The character used for escaping CSV separators or other reserved characters. If an escape is specified, the encapsulator is not used unless also explicitly specified since most formats use either encapsulation or escaping, not both |g |escape=\
+|keepEmpty |Keep and index zero length (empty) fields. Default=false. |g,f |f.price.keepEmpty=true
+|map |Map one value to another. Format is value:replacement (which can be empty.) |g,f |map=left:right f.subject.map=history:bunk
+|split |If true, split a field into multiple values by a separate parser. |f |
+|overwrite |If true (the default), check for and overwrite duplicate documents, based on the uniqueKey field declared in the Solr schema. If you know the documents you are indexing do not contain any duplicates then you may see a considerable speed up setting this to false. |g |
+|commit |Issues a commit after the data has been ingested. |g |
+|commitWithin |Add the document within the specified number of milliseconds. |g |commitWithin=10000
+|rowid |Map the rowid (line number) to a field specified by the value of the parameter, for instance if your CSV doesn't have a unique key and you want to use the row id as such. |g |rowid=id
+|rowidOffset |Add the given offset (as an int) to the rowid before adding it to the document. Default is 0 |g |rowidOffset=10
+|===
+
+[[UploadingDatawithIndexHandlers-IndexingTab-Delimitedfiles]]
+=== Indexing Tab-Delimited files
+
+The same feature used to index CSV documents can also be easily used to index tab-delimited files (TSV files) and even handle backslash escaping rather than CSV encapsulation.
+
+For example, one can dump a MySQL table to a tab delimited file with:
+
+[source,sql]
+----
+SELECT * INTO OUTFILE '/tmp/result.txt' FROM mytable;
+----
+
+This file could then be imported into Solr by setting the `separator` to tab (%09) and the `escape` to backslash (%5c).
+
+[source,bash]
+----
+curl 'http://localhost:8983/solr/my_collection/update/csv?commit=true&separator=%09&escape=%5c' --data-binary @/tmp/result.txt
+----
+
+[[UploadingDatawithIndexHandlers-CSVUpdateConveniencePaths]]
+=== CSV Update Convenience Paths
+
+In addition to the `/update` handler, there is an additional CSV specific request handler path available by default in Solr, that implicitly override the behavior of some request parameters:
+
+[cols=",",options="header",]
+|===
+|Path |Default Parameters
+|`/update/csv` |`stream.contentType=application/csv`
+|===
+
+The `/update/csv` path may be useful for clients sending in CSV formatted update commands from applications where setting the Content-Type proves difficult.
+
+[[UploadingDatawithIndexHandlers-NestedChildDocuments]]
+== Nested Child Documents
+
+Solr indexes nested documents in blocks as a way to model documents containing other documents, such as a blog post parent document and comments as child documents -- or products as parent documents and sizes, colors, or other variations as child documents. At query time, the <<other-parsers.adoc#OtherParsers-BlockJoinQueryParsers,Block Join Query Parsers>> can search these relationships. In terms of performance, indexing the relationships between documents may be more efficient than attempting to do joins only at query time, since the relationships are already stored in the index and do not need to be computed.
+
+Nested documents may be indexed via either the XML or JSON data syntax (or using <<using-solrj.adoc#using-solrj,SolrJ)>> - but regardless of syntax, you must include a field that identifies the parent document as a parent; it can be any field that suits this purpose, and it will be used as input for the <<other-parsers.adoc#OtherParsers-BlockJoinQueryParsers,block join query parsers>>.
+
+To support nested documents, the schema must include an indexed/non-stored field `\_root_`. The value of that field is populated automatically and is the same for all documents in the block, regardless of the inheritance depth.
+
+[[UploadingDatawithIndexHandlers-XMLExamples]]
+=== XML Examples
+
+For example, here are two documents and their child documents:
+
+[source,xml]
+----
+<add>
+  <doc>
+  <field name="id">1</field>
+  <field name="title">Solr adds block join support</field>
+  <field name="content_type">parentDocument</field>
+    <doc>
+      <field name="id">2</field>
+      <field name="comments">SolrCloud supports it too!</field>
+    </doc>
+  </doc>
+  <doc>
+    <field name="id">3</field>
+    <field name="title">New Lucene and Solr release is out</field>
+    <field name="content_type">parentDocument</field>
+    <doc>
+      <field name="id">4</field>
+      <field name="comments">Lots of new features</field>
+    </doc>
+  </doc>
+</add>
+----
+
+In this example, we have indexed the parent documents with the field `content_type`, which has the value "parentDocument". We could have also used a boolean field, such as `isParent`, with a value of "true", or any other similar approach.
+
+[[UploadingDatawithIndexHandlers-JSONExamples]]
+=== JSON Examples
+
+This example is equivalent to the XML example above, note the special `\_childDocuments_` key need to indicate the nested documents in JSON.
+
+[source,json]
+----
+[
+  {
+    "id": "1",
+    "title": "Solr adds block join support",
+    "content_type": "parentDocument",
+    "_childDocuments_": [
+      {
+        "id": "2",
+        "comments": "SolrCloud supports it too!"
+      }
+    ]
+  },
+  {
+    "id": "3",
+    "title": "New Lucene and Solr release is out",
+    "content_type": "parentDocument",
+    "_childDocuments_": [
+      {
+        "id": "4",
+        "comments": "Lots of new features"
+      }
+    ]
+  }
+]
+----
+
+.Note
+[NOTE]
+====
+One limitation of indexing nested documents is that the whole block of parent-children documents must be updated together whenever any changes are required. In other words, even if a single child document or the parent document is changed, the whole block of parent-child documents must be indexed together.
+====

http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/95968c69/solr/solr-ref-guide/src/uploading-data-with-solr-cell-using-apache-tika.adoc
----------------------------------------------------------------------
diff --git a/solr/solr-ref-guide/src/uploading-data-with-solr-cell-using-apache-tika.adoc b/solr/solr-ref-guide/src/uploading-data-with-solr-cell-using-apache-tika.adoc
new file mode 100644
index 0000000..0247e98
--- /dev/null
+++ b/solr/solr-ref-guide/src/uploading-data-with-solr-cell-using-apache-tika.adoc
@@ -0,0 +1,345 @@
+= Uploading Data with Solr Cell using Apache Tika
+:page-shortname: uploading-data-with-solr-cell-using-apache-tika
+:page-permalink: uploading-data-with-solr-cell-using-apache-tika.html
+
+Solr uses code from the http://lucene.apache.org/tika/[Apache Tika] project to provide a framework for incorporating many different file-format parsers such as http://incubator.apache.org/pdfbox/[Apache PDFBox] and http://poi.apache.org/index.html[Apache POI] into Solr itself. Working with this framework, Solr's `ExtractingRequestHandler` can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing.
+
+When this framework was under development, it was called the Solr Content Extraction Library or CEL; from that abbreviation came this framework's name: Solr Cell.
+
+If you want to supply your own `ContentHandler` for Solr to use, you can extend the `ExtractingRequestHandler` and override the `createFactory()` method. This factory is responsible for constructing the `SolrContentHandler` that interacts with Tika, and allows literals to override Tika-parsed values. Set the parameter `literalsOverride`, which normally defaults to *true, to *false" to append Tika-parsed values to literal values.
+
+For more information on Solr's Extracting Request Handler, see https://wiki.apache.org/solr/ExtractingRequestHandler.
+
+[[UploadingDatawithSolrCellusingApacheTika-KeyConcepts]]
+== Key Concepts
+
+When using the Solr Cell framework, it is helpful to keep the following in mind:
+
+* Tika will automatically attempt to determine the input document type (Word, PDF, HTML) and extract the content appropriately. If you like, you can explicitly specify a MIME type for Tika with the `stream.type` parameter.
+* Tika works by producing an XHTML stream that it feeds to a SAX ContentHandler. SAX is a common interface implemented for many different XML parsers. For more information, see http://www.saxproject.org/quickstart.html.
+* Solr then responds to Tika's SAX events and creates the fields to index.
+* Tika produces metadata such as Title, Subject, and Author according to specifications such as the DublinCore. See http://tika.apache.org/1.7/formats.html for the file types supported.
+* Tika adds all the extracted text to the `content` field.
+* You can map Tika's metadata fields to Solr fields.
+* You can pass in literals for field values. Literals will override Tika-parsed values, including fields in the Tika metadata object, the Tika content field, and any "captured content" fields.
+* You can apply an XPath expression to the Tika XHTML to restrict the content that is produced.
+
+[TIP]
+====
+
+While Apache Tika is quite powerful, it is not perfect and fails on some files. PDF files are particularly problematic, mostly due to the PDF format itself. In case of a failure processing any file, the `ExtractingRequestHandler` does not have a secondary mechanism to try to extract some text from the file; it will throw an exception and fail.
+
+====
+
+[[UploadingDatawithSolrCellusingApacheTika-TryingoutTikawiththeSolrtechproductsExample]]
+== Trying out Tika with the Solr `techproducts` Example
+
+You can try out the Tika framework using the `techproducts` example included in Solr.
+
+Start the example:
+
+[source,bash]
+----
+bin/solr -e techproducts
+----
+
+You can now use curl to send a sample PDF file via HTTP POST:
+
+[source,bash]
+----
+curl 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true' -F "myfile=@example/exampledocs/solr-word.pdf"
+----
+
+The URL above calls the Extracting Request Handler, uploads the file `solr-word.pdf` and assigns it the unique ID `doc1`. Here's a closer look at the components of this command:
+
+* The `literal.id=doc1` parameter provides the necessary unique ID for the document being indexed.
+
+* The `commit=true parameter` causes Solr to perform a commit after indexing the document, making it immediately searchable. For optimum performance when loading many documents, don't call the commit command until you are done.
+
+* The `-F` flag instructs curl to POST data using the Content-Type `multipart/form-data` and supports the uploading of binary files. The @ symbol instructs curl to upload the attached file.
+
+* The argument `myfile=@tutorial.html` needs a valid path, which can be absolute or relative.
+
+You can also use `bin/post` to send a PDF file into Solr (without the params, the `literal.id` parameter would be set to the absolute path to the file):
+
+[source,bash]
+----
+bin/post -c techproducts example/exampledocs/solr-word.pdf -params "literal.id=a"
+----
+
+Now you should be able to execute a query and find that document. You can make a request like `\http://localhost:8983/solr/techproducts/select?q=pdf` .
+
+You may notice that although the content of the sample document has been indexed and stored, there are not a lot of metadata fields associated with this document. This is because unknown fields are ignored according to the default parameters configured for the `/update/extract` handler in `solrconfig.xml`, and this behavior can be easily changed or overridden. For example, to store and see all metadata and content, execute the following:
+
+[source,bash]
+----
+bin/post -c techproducts example/exampledocs/solr-word.pdf -params "literal.id=doc1&uprefix=attr_"
+----
+
+In this command, the `uprefix=attr_` parameter causes all generated fields that aren't defined in the schema to be prefixed with `attr_`, which is a dynamic field that is stored and indexed.
+
+This command allows you to query the document using an attribute, as in: `\http://localhost:8983/solr/techproducts/select?q=attr_meta:microsoft`.
+
+[[UploadingDatawithSolrCellusingApacheTika-InputParameters]]
+== Input Parameters
+
+The table below describes the parameters accepted by the Extracting Request Handler.
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="30,70",options="header"]
+|===
+|Parameter |Description
+|capture |Captures XHTML elements with the specified name for a supplementary addition to the Solr document. This parameter can be useful for copying chunks of the XHTML into a separate field. For instance, it could be used to grab paragraphs (`<p>`) and index them into a separate field. Note that content is still also captured into the overall "content" field.
+|captureAttr |Indexes attributes of the Tika XHTML elements into separate fields, named after the element. If set to true, for example, when extracting from HTML, Tika can return the href attributes in <a> tags as fields named "a". See the examples below.
+|commitWithin |Add the document within the specified number of milliseconds.
+|date.formats |Defines the date format patterns to identify in the documents.
+|defaultField |If the uprefix parameter (see below) is not specified and a field cannot be determined, the default field will be used.
+|extractOnly |Default is false. If true, returns the extracted content from Tika without indexing the document. This literally includes the extracted XHTML as a string in the response. When viewing manually, it may be useful to use a response format other than XML to aid in viewing the embedded XHTML tags.For an example, see http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput.
+|extractFormat |Default is "xml", but the other option is "text". Controls the serialization format of the extract content. The xml format is actually XHTML, the same format that results from passing the `-x` command to the Tika command line application, while the text format is like that produced by Tika's `-t` command. This parameter is valid only if `extractOnly` is set to true.
+|fmap.<__source_field__> |Maps (moves) one field name to another. The `source_field` must be a field in incoming documents, and the value is the Solr field to map to. Example: `fmap.content=text` causes the data in the `content` field generated by Tika to be moved to the Solr's `text` field.
+|ignoreTikaException |If true, exceptions found during processing will be skipped. Any metadata available, however, will be indexed.
+|literal.<__fieldname__> |Populates a field with the name supplied with the specified value for each document. The data can be multivalued if the field is multivalued.
+|literalsOverride |If true (the default), literal field values will override other values with the same field name. If false, literal values defined with `literal.<__fieldname__>` will be appended to data already in the fields extracted from Tika. If setting `literalsOverride` to "false", the field must be multivalued.
+|lowernames |Values are "true" or "false". If true, all field names will be mapped to lowercase with underscores, if needed. For example, "Content-Type" would be mapped to "content_type."
+|multipartUploadLimitInKB |Useful if uploading very large documents, this defines the KB size of documents to allow.
+|passwordsFile |Defines a file path and name for a file of file name to password mappings.
+|resource.name |Specifies the optional name of the file. Tika can use it as a hint for detecting a file's MIME type.
+|resource.password |Defines a password to use for a password-protected PDF or OOXML file
+|tika.config |Defines a file path and name to a customized Tika configuration file. This is only required if you have customized your Tika implementation.
+|uprefix |Prefixes all fields that are not defined in the schema with the given prefix. This is very useful when combined with dynamic field definitions. Example: `uprefix=ignored_` would effectively ignore all unknown fields generated by Tika given the example schema contains `<dynamicField name="ignored_*" type="ignored"/>`
+|xpath |When extracting, only return Tika XHTML content that satisfies the given XPath expression. See http://tika.apache.org/1.7/index.html for details on the format of Tika XHTML. See also http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput.
+|===
+
+[[UploadingDatawithSolrCellusingApacheTika-OrderofOperations]]
+== Order of Operations
+
+Here is the order in which the Solr Cell framework, using the Extracting Request Handler and Tika, processes its input.
+
+1.  Tika generates fields or passes them in as literals specified by `literal.<fieldname>=<value>`. If `literalsOverride=false`, literals will be appended as multi-value to the Tika-generated field.
+2.  If `lowernames=true`, Tika maps fields to lowercase.
+3.  Tika applies the mapping rules specified by `fmap.__source__=__target__` parameters.
+4.  If `uprefix` is specified, any unknown field names are prefixed with that value, else if `defaultField` is specified, any unknown fields are copied to the default field.
+
+[[UploadingDatawithSolrCellusingApacheTika-ConfiguringtheSolrExtractingRequestHandler]]
+== Configuring the Solr `ExtractingRequestHandler`
+
+If you are not working with the supplied `sample_techproducts_configs `or` data_driven_schema_configs` <<config-sets.adoc#config-sets,config set>>, you must configure your own `solrconfig.xml` to know about the Jar's containing the `ExtractingRequestHandler` and its dependencies:
+
+[source,xml]
+----
+  <lib dir="${solr.install.dir:../../..}/contrib/extraction/lib" regex=".*\.jar" />
+  <lib dir="${solr.install.dir:../../..}/dist/" regex="solr-cell-\d.*\.jar" />
+----
+
+You can then configure the `ExtractingRequestHandler` in `solrconfig.xml`.
+
+[source,xml]
+----
+<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
+  <lst name="defaults">
+    <str name="fmap.Last-Modified">last_modified</str>
+    <str name="uprefix">ignored_</str>
+  </lst>
+  <!--Optional.  Specify a path to a tika configuration file. See the Tika docs for details.-->
+  <str name="tika.config">/my/path/to/tika.config</str>
+  <!-- Optional. Specify one or more date formats to parse. See DateUtil.DEFAULT_DATE_FORMATS
+       for default date formats -->
+  <lst name="date.formats">
+    <str>yyyy-MM-dd</str>
+  </lst>
+  <!-- Optional. Specify an external file containing parser-specific properties.
+       This file is located in the same directory as solrconfig.xml by default.-->
+  <str name="parseContext.config">parseContext.xml</str>
+</requestHandler>
+----
+
+In the defaults section, we are mapping Tika's Last-Modified Metadata attribute to a field named `last_modified`. We are also telling it to ignore undeclared fields. These are all overridden parameters.
+
+The `tika.config` entry points to a file containing a Tika configuration. The `date.formats` allows you to specify various `java.text.SimpleDateFormats` date formats for working with transforming extracted input to a Date. Solr comes configured with the following date formats (see the `DateUtil` in Solr):
+
+* `yyyy-MM-dd'T'HH:mm:ss'Z'`
+* `yyyy-MM-dd'T'HH:mm:ss`
+* `yyyy-MM-dd`
+* `yyyy-MM-dd hh:mm:ss`
+* `yyyy-MM-dd HH:mm:ss`
+* `EEE MMM d hh:mm:ss z yyyy`
+* `EEE, dd MMM yyyy HH:mm:ss zzz`
+* `EEEE, dd-MMM-yy HH:mm:ss zzz`
+* `EEE MMM d HH:mm:ss yyyy`
+
+You may also need to adjust the `multipartUploadLimitInKB` attribute as follows if you are submitting very large documents.
+
+[source,xml]
+----
+<requestDispatcher handleSelect="true" >
+  <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="20480" />
+  ...
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-Parserspecificproperties]]
+=== Parser specific properties
+
+Parsers used by Tika may have specific properties to govern how data is extracted. For instance, when using the Tika library from a Java program, the PDFParserConfig class has a method setSortByPosition(boolean) that can extract vertically oriented text. To access that method via configuration with the ExtractingRequestHandler, one can add the parseContext.config property to the solrconfig.xml file (see above) and then set properties in Tika's PDFParserConfig as below. Consult the Tika Java API documentation for configuration parameters that can be set for any particular parsers that require this level of control.
+
+[source,xml]
+----
+<entries>
+  <entry class="org.apache.tika.parser.pdf.PDFParserConfig" impl="org.apache.tika.parser.pdf.PDFParserConfig">
+    <property name="extractInlineImages" value="true"/>
+    <property name="sortByPosition" value="true"/>
+  </entry>
+  <entry>...</entry>
+</entries>
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-Multi-CoreConfiguration]]
+=== Multi-Core Configuration
+
+For a multi-core configuration, you can specify `sharedLib='lib'` in the `<solr/>` section of `solr.xml` and place the necessary jar files there.
+
+For more information about Solr cores, see <<the-well-configured-solr-instance.adoc#the-well-configured-solr-instance,The Well-Configured Solr Instance>>.
+
+[[UploadingDatawithSolrCellusingApacheTika-IndexingEncryptedDocumentswiththeExtractingUpdateRequestHandler]]
+== Indexing Encrypted Documents with the ExtractingUpdateRequestHandler
+
+The ExtractingRequestHandler will decrypt encrypted files and index their content if you supply a password in either `resource.password` on the request, or in a `passwordsFile` file.
+
+In the case of `passwordsFile`, the file supplied must be formatted so there is one line per rule. Each rule contains a file name regular expression, followed by "=", then the password in clear-text. Because the passwords are in clear-text, the file should have strict access restrictions.
+
+[source,plain]
+----
+# This is a comment
+myFileName = myPassword
+.*\.docx$ = myWordPassword
+.*\.pdf$ = myPdfPassword
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-Examples]]
+== Examples
+
+[[UploadingDatawithSolrCellusingApacheTika-Metadata]]
+=== Metadata
+
+As mentioned before, Tika produces metadata about the document. Metadata describes different aspects of a document, such as the author's name, the number of pages, the file size, and so on. The metadata produced depends on the type of document submitted. For instance, PDFs have different metadata than Word documents do.
+
+In addition to Tika's metadata, Solr adds the following metadata (defined in `ExtractingMetadataConstants`):
+
+// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed
+
+[cols="30,70",options="header"]
+|===
+|Solr Metadata |Description
+|stream_name |The name of the Content Stream as uploaded to Solr. Depending on how the file is uploaded, this may or may not be set
+|stream_source_info |Any source info about the stream. (See the section on Content Streams later in this section.)
+|stream_size |The size of the stream in bytes.
+|stream_content_type |The content type of the stream, if available.
+|===
+
+[IMPORTANT]
+====
+
+We recommend that you try using the `extractOnly` option to discover which values Solr is setting for these metadata elements.
+
+====
+
+[[UploadingDatawithSolrCellusingApacheTika-ExamplesofUploadsUsingtheExtractingRequestHandler]]
+=== Examples of Uploads Using the Extracting Request Handler
+
+[[UploadingDatawithSolrCellusingApacheTika-CaptureandMapping]]
+==== Capture and Mapping
+
+The command below captures `<div>` tags separately, and then maps all the instances of that field to a dynamic field named `foo_t`.
+
+[source,bash]
+----
+bin/post -c techproducts example/exampledocs/sample.html -params "literal.id=doc2&captureAttr=true&defaultField=_text_&fmap.div=foo_t&capture=div"
+----
+
+
+[[UploadingDatawithSolrCellusingApacheTika-Capture_Mapping]]
+==== Capture & Mapping
+
+The command below captures `<div>` tags separately and maps the field to a dynamic field named `foo_t`.
+
+[source,bash]
+----
+bin/post -c techproducts example/exampledocs/sample.html -params "literal.id=doc3&captureAttr=true&defaultField=_text_&capture=div&fmap.div=foo_t"
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-UsingLiteralstoDefineYourOwnMetadata]]
+==== Using Literals to Define Your Own Metadata
+
+To add in your own metadata, pass in the literal parameter along with the file:
+
+[source,bash]
+----
+bin/post -c techproducts -params "literal.id=doc4&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&literal.blah_s=Bah" example/exampledocs/sample.html
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-XPath]]
+==== XPath
+
+The example below passes in an XPath expression to restrict the XHTML returned by Tika:
+
+[source,bash]
+----
+bin/post -c techproducts -params "literal.id=doc5&captureAttr=true&defaultField=text&capture=div&fmap.div=foo_t&xpath=/xhtml:html/xhtml:body/xhtml:div//node()" example/exampledocs/sample.html
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-ExtractingDatawithoutIndexingIt]]
+=== Extracting Data without Indexing It
+
+Solr allows you to extract data without indexing. You might want to do this if you're using Solr solely as an extraction server or if you're interested in testing Solr extraction.
+
+The example below sets the `extractOnly=true` parameter to extract data without indexing it.
+
+[source,bash]
+----
+curl "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true" --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'
+----
+
+The output includes XML generated by Tika (and further escaped by Solr's XML) using a different output format to make it more readable (`-out yes` instructs the tool to echo Solr's output to the console):
+
+[source,bash]
+----
+bin/post -c techproducts -params "extractOnly=true&wt=ruby&indent=true" -out yes example/exampledocs/sample.html
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-SendingDocumentstoSolrwithaPOST]]
+== Sending Documents to Solr with a POST
+
+The example below streams the file as the body of the POST, which does not, then, provide information to Solr about the name of the file.
+
+[source,bash]
+----
+curl "http://localhost:8983/solr/techproducts/update/extract?literal.id=doc6&defaultField=text&commit=true" --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'
+----
+
+[[UploadingDatawithSolrCellusingApacheTika-SendingDocumentstoSolrwithSolrCellandSolrJ]]
+== Sending Documents to Solr with Solr Cell and SolrJ
+
+SolrJ is a Java client that you can use to add documents to the index, update the index, or query the index. You'll find more information on SolrJ in <<client-apis.adoc#client-apis,Client APIs>>.
+
+Here's an example of using Solr Cell and SolrJ to add documents to a Solr index.
+
+First, let's use SolrJ to create a new SolrClient, then we'll construct a request containing a ContentStream (essentially a wrapper around a file) and sent it to Solr:
+
+[source,java]
+----
+public class SolrCellRequestDemo {
+  public static void main (String[] args) throws IOException, SolrServerException {
+    SolrClient client = new HttpSolrClient.Builder("http://localhost:8983/solr/my_collection").build();
+    ContentStreamUpdateRequest req = new ContentStreamUpdateRequest("/update/extract");
+    req.addFile(new File("my-file.pdf"));
+    req.setParam(ExtractingParams.EXTRACT_ONLY, "true");
+    NamedList<Object> result = client.request(req);
+    System.out.println("Result: " + result);
+}
+----
+
+This operation streams the file `my-file.pdf` into the Solr index for `my_collection`.
+
+The sample code above calls the extract command, but you can easily substitute other commands that are supported by Solr Cell. The key class to use is the `ContentStreamUpdateRequest`, which makes sure the ContentStreams are set properly. SolrJ takes care of the rest.
+
+Note that the `ContentStreamUpdateRequest` is not just specific to Solr Cell. You can send CSV to the CSV Update handler and to any other Request Handler that works with Content Streams for updates.


Mime
View raw message