Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 8FAB1200C82 for ; Fri, 12 May 2017 16:05:17 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 8D111160BD7; Fri, 12 May 2017 14:05:17 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 88035160BCE for ; Fri, 12 May 2017 16:05:14 +0200 (CEST) Received: (qmail 88740 invoked by uid 500); 12 May 2017 14:05:12 -0000 Mailing-List: contact commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list commits@lucene.apache.org Received: (qmail 86286 invoked by uid 99); 12 May 2017 14:05:10 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 May 2017 14:05:10 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id C81B5DFFB2; Fri, 12 May 2017 14:05:09 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit From: ctargett@apache.org To: commits@lucene.apache.org Date: Fri, 12 May 2017 14:05:15 -0000 Message-Id: In-Reply-To: References: X-Mailer: ASF-Git Admin Mailer Subject: [07/37] lucene-solr:branch_6x: squash merge jira/solr-10290 into master archived-at: Fri, 12 May 2017 14:05:17 -0000 http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ccbc93b8/solr/solr-ref-guide/src/suggester.adoc ---------------------------------------------------------------------- diff --git a/solr/solr-ref-guide/src/suggester.adoc b/solr/solr-ref-guide/src/suggester.adoc new file mode 100644 index 0000000..8971860 --- /dev/null +++ b/solr/solr-ref-guide/src/suggester.adoc @@ -0,0 +1,460 @@ += Suggester +:page-shortname: suggester +:page-permalink: suggester.html + +The SuggestComponent in Solr provides users with automatic suggestions for query terms. + +You can use this to implement a powerful auto-suggest feature in your search application. + +Although it is possible to use the <> functionality to power autosuggest behavior, Solr has a dedicated http://lucene.apache.org/solr/api/solr-core/org/apache/solr/handler/component/SuggestComponent.html[SuggestComponent] designed for this functionality. + +This approach utilizes Lucene's Suggester implementation and supports all of the lookup implementations available in Lucene. + +The main features of this Suggester are: + +* Lookup implementation pluggability +* Term dictionary pluggability, giving you the flexibility to choose the dictionary implementation +* Distributed support + +The `solrconfig.xml` found in Solr's "```techproducts```" example has the new Suggester implementation configured already. For more on search components, see the section <>. + +[[Suggester-ConfiguringSuggesterinsolrconfig.xml]] +== Configuring Suggester in solrconfig.xml + +The "```techproducts```" example `solrconfig.xml` has a `suggest` search component and a `/suggest` request handler already configured. You can use that as the basis for your configuration, or create it from scratch, as detailed below. + +[[Suggester-AddingtheSuggestSearchComponent]] +=== Adding the Suggest Search Component + +The first step is to add a search component to `solrconfig.xml` and tell it to use the SuggestComponent. Here is some sample code that could be used. + +[source,xml] +---- + + + mySuggester + FuzzyLookupFactory + DocumentDictionaryFactory + cat + price + string + false + + +---- + +[[Suggester-SuggesterSearchComponentParameters]] +==== Suggester Search Component Parameters + +The Suggester search component takes several configuration parameters. The choice of the lookup implementation (`lookupImpl`, how terms are found in the suggestion dictionary) and the dictionary implementation (`dictionaryImpl`, how terms are stored in the suggestion dictionary) will dictate some of the parameters required. Below are the main parameters that can be used no matter what lookup or dictionary implementation is used. In the following sections additional parameters are provided for each implementation. + +// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed + +[cols="30,70",options="header"] +|=== +|Parameter |Description +|searchComponent name |Arbitrary name for the search component. +|name |A symbolic name for this suggester. You can refer to this name in the URL parameters and in the SearchHandler configuration. It is possible to have mutiples of these +|lookupImpl |Lookup implementation. There are several possible implementations, described below in the section <>. If not set, the default lookup is JaspellLookupFactory. +|dictionaryImpl |The dictionary implementation to use. There are several possible implementations, described below in the section <> . If not set, the default dictionary implementation is HighFrequencyDictionaryFactory unless a `sourceLocation` is used, in which case, the dictionary implementation will be FileDictionaryFactory +|field a| +A field from the index to use as the basis of suggestion terms. If `sourceLocation` is empty (meaning any dictionary implementation other than FileDictionaryFactory) then terms from this field in the index will be used. + +To be used as the basis for a suggestion, the field must be stored. You may want to <> to create a special 'suggest' field comprised of terms from other fields in documents. In any event, you likely want a minimal amount of analysis on the field, so an additional option is to create a field type in your schema that only uses basic tokenizers or filters. One option for such a field type is shown here: + +[source,xml] +---- + + + + + + + +---- + +However, this minimal analysis is not required if you want more analysis to occur on terms. If using the AnalyzingLookupFactory as your lookupImpl, however, you have the option of defining the field type rules to use for index and query time analysis. + +|sourceLocation |The path to the dictionary file if using the FileDictionaryFactory. If this value is empty then the main index will be used as a source of terms and weights. +|storeDir |The location to store the dictionary file. +|buildOnCommit or buildOnOptimize |If true then the lookup data structure will be rebuilt after soft-commit. If false, the default, then the lookup data will be built only when requested by URL parameter `suggest.build=true`. Use `buildOnCommit` to rebuild the dictionary with every soft-commit, or `buildOnOptimize` to build the dictionary only when the index is optimized. Some lookup implementations may take a long time to build, specially with large indexes, in such cases, using buildOnCommit or buildOnOptimize, particularly with a high frequency of softCommits is not recommended, and it's recommended instead to build the suggester at a lower frequency by manually issuing requests with `suggest.build=true`. +|buildOnStartup |If true then the lookup data structure will be built when Solr starts or when the core is reloaded. If this parameter is not specified, the suggester will check if the lookup data structure is present on disk and build it if not found. Enabling this to true could lead to the core talking longer to load (or reload) as the suggester data structure needs to be built, which can sometimes take a long time. It’s usually preferred to have this setting set to 'false' and build suggesters manually issuing requests with `suggest.build=true`. +|=== + +[[Suggester-LookupImplementations]] +==== Lookup Implementations + +The `lookupImpl` parameter defines the algorithms used to look up terms in the suggest index. There are several possible implementations to choose from, and some require additional parameters to be configured. + +[[Suggester-AnalyzingLookupFactory]] +===== AnalyzingLookupFactory + +A lookup that first analyzes the incoming text and adds the analyzed form to a weighted FST, and then does the same thing at lookup time. + +This implementation uses the following additional properties: + +* suggestAnalyzerFieldType: The field type to use for the query-time and build-time term suggestion analysis. +* exactMatchFirst: If true, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. +* preserveSep: If true, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball). +* preservePositionIncrements: If true, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default is false. + +[[Suggester-FuzzyLookupFactory]] +===== FuzzyLookupFactory + +This is a suggester which is an extension of the AnalyzingSuggester but is fuzzy in nature. The similarity is measured by the Levenshtein algorithm. + +This implementation uses the following additional properties: + +* exactMatchFirst: If true, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. +* preserveSep: If true, the default, then a separator between tokens is preserved. This means that suggestions are sensitive to tokenization (e.g., baseball is different from base ball). +* maxSurfaceFormsPerAnalyzedForm: Maximum number of surface forms to keep for a single analyzed form. When there are too many surface forms we discard the lowest weighted ones. +* maxGraphExpansions: When building the FST ("index-time"), we add each path through the tokenstream graph as an individual entry. This places an upper-bound on how many expansions will be added for a single suggestion. The default is -1 which means there is no limit. +* preservePositionIncrements: If true, the suggester will preserve position increments. This means that token filters which leave gaps (for example, when StopFilter matches a stopword) the position would be respected when building the suggester. The default is false. +* maxEdits: The maximum number of string edits allowed. The systems hard limit is 2. The default is 1. +* transpositions: If true, the default, transpositions should be treated as a primitive edit operation. +* nonFuzzyPrefix: The length of the common non fuzzy prefix match which must match a suggestion. The default is 1. +* minFuzzyLength: The minimum length of query before which any string edits will be allowed. The default is 3. +* unicodeAware: If true, maxEdits, minFuzzyLength, transpositions and nonFuzzyPrefix parameters will be measured in unicode code points (actual letters) instead of bytes. The default is false. + +[[Suggester-AnalyzingInfixLookupFactory]] +===== AnalyzingInfixLookupFactory + +Analyzes the input text and then suggests matches based on prefix matches to any tokens in the indexed text. This uses a Lucene index for its dictionary. + +This implementation uses the following additional properties. + +* indexPath: When using AnalyzingInfixSuggester you can provide your own path where the index will get built. The default is analyzingInfixSuggesterIndexDir and will be created in your collections data directory. +* minPrefixChars: Minimum number of leading characters before PrefixQuery is used (default is 4). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster). +* allTermsRequired: Boolean option for multiple terms. Default is true - all terms required. +* highlight: Highlight suggest terms. Default is true. + +This implementation supports <>. + +[[Suggester-BlendedInfixLookupFactory]] +===== BlendedInfixLookupFactory + +An extension of the AnalyzingInfixSuggester which provides additional functionality to weight prefix matches across the matched documents. You can tell it to score higher if a hit is closer to the start of the suggestion or vice versa. + +This implementation uses the following additional properties: + +* blenderType: used to calculate weight coefficient using the position of the first matching word. Can be one of: +** position_linear: weightFieldValue*(1 - 0.10*position): Matches to the start will be given a higher score (Default) +** position_reciprocal: weightFieldValue/(1+position): Matches to the end will be given a higher score. +*** exponent: an optional configuration variable for the position_reciprocal blenderType used to control how fast the score will increase or decrease. Default 2.0. +* numFactor: The factor to multiply the number of searched elements from which results will be pruned. Default is 10. +* indexPath: When using BlendedInfixSuggester you can provide your own path where the index will get built. The default directory name is blendedInfixSuggesterIndexDir and will be created in your collections data directory. +* minPrefixChars: Minimum number of leading characters before PrefixQuery is used (default 4). Prefixes shorter than this are indexed as character ngrams (increasing index size but making lookups faster). + +This implementation supports <> . + +[[Suggester-FreeTextLookupFactory]] +===== FreeTextLookupFactory + +It looks at the last tokens plus the prefix of whatever final token the user is typing, if present, to predict the most likely next token. The number of previous tokens that need to be considered can also be specified. This suggester would only be used as a fallback, when the primary suggester fails to find any suggestions. + +This implementation uses the following additional properties: + +* suggestFreeTextAnalyzerFieldType: The analyzer used at "query-time" and "build-time" to analyze suggestions. This field is required. +* ngrams: The max number of tokens out of which singles will be make the dictionary. The default value is 2. Increasing this would mean you want more than the previous 2 tokens to be taken into consideration when making the suggestions. + +[[Suggester-FSTLookupFactory]] +===== FSTLookupFactory + +An automaton-based lookup. This implementation is slower to build, but provides the lowest memory cost. We recommend using this implementation unless you need more sophisticated matching results, in which case you should use the Jaspell implementation. + +This implementation uses the following additional properties: + +* exactMatchFirst: If true, the default, exact suggestions are returned first, even if they are prefixes or other strings in the FST have larger weights. +* weightBuckets: The number of separate buckets for weights which the suggester will use while building its dictionary. + +[[Suggester-TSTLookupFactory]] +===== TSTLookupFactory + +A simple compact ternary trie based lookup. + +[[Suggester-WFSTLookupFactory]] +===== WFSTLookupFactory + +A weighted automaton representation which is an alternative to FSTLookup for more fine-grained ranking. WFSTLookup does not use buckets, but instead a shortest path algorithm. Note that it expects weights to be whole numbers. If weight is missing it's assumed to be 1.0. Weights affect the sorting of matching suggestions when `spellcheck.onlyMorePopular=true` is selected: weights are treated as "popularity" score, with higher weights preferred over suggestions with lower weights. + +[[Suggester-JaspellLookupFactory]] +===== JaspellLookupFactory + +A more complex lookup based on a ternary trie from the http://jaspell.sourceforge.net/[JaSpell] project. Use this implementation if you need more sophisticated matching results. + +[[Suggester-DictionaryImplementations]] +==== Dictionary Implementations + +The dictionary implementations define how terms are stored. There are several options, and multiple dictionaries can be used in a single request if necessary. + +[[Suggester-DocumentDictionaryFactory]] +===== DocumentDictionaryFactory + +A dictionary with terms, weights, and an optional payload taken from the index. + +This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation: + +* weightField: A field that is stored or a numeric DocValue field. This field is optional. +* payloadField: The payloadField should be a field that is stored. This field is optional. +* contextField: Field to be used for context filtering. Note that only some lookup implementations support filtering. + +[[Suggester-DocumentExpressionDictionaryFactory]] +===== DocumentExpressionDictionaryFactory + +This dictionary implementation is the same as the DocumentDictionaryFactory but allows users to specify an arbitrary expression into the 'weightExpression' tag. + +This dictionary implementation takes the following parameters in addition to parameters described for the Suggester generally and for the lookup implementation: + +* payloadField: The payloadField should be a field that is stored. This field is optional. +* weightExpression: An arbitrary expression used for scoring the suggestions. The fields used must be numeric fields. This field is required. +* contextField: Field to be used for context filtering. Note that only some lookup implementations support filtering. + +[[Suggester-HighFrequencyDictionaryFactory]] +===== HighFrequencyDictionaryFactory + +This dictionary implementation allows adding a threshold to prune out less frequent terms in cases where very common terms may overwhelm other terms. + +This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation: + +* threshold: A value between zero and one representing the minimum fraction of the total documents where a term should appear in order to be added to the lookup dictionary. + +[[Suggester-FileDictionaryFactory]] +===== FileDictionaryFactory + +This dictionary implementation allows using an external file that contains suggest entries. Weights and payloads can also be used. + +If using a dictionary file, it should be a plain text file in UTF-8 encoding. You can use both single terms and phrases in the dictionary file. If adding weights or payloads, those should be separated from terms using the delimiter defined with the `fieldDelimiter` property (the default is '\t', the tab representation). If using payloads, the first line in the file *must* specify a payload. + +This dictionary implementation takes one parameter in addition to parameters described for the Suggester generally and for the lookup implementation: + +fieldDelimiter:: Specify the delimiter to be used separating the entries, weights and payloads. The default is tab ('\t'). + +.Example File +[source,text] +---- +acquire +accidentally 2.0 +accommodate 3.0 +---- + +[[Suggester-MultipleDictionaries]] +==== Multiple Dictionaries + +It is possible to include multiple `dictionaryImpl` definitions in a single SuggestComponent definition. + +To do this, simply define separate suggesters, as in this example: + +[source,xml] +---- + + + mySuggester + FuzzyLookupFactory + DocumentDictionaryFactory + cat + price + string + + + altSuggester + DocumentExpressionDictionaryFactory + FuzzyLookupFactory + product_name + ((price * 2) + ln(popularity)) + weight + price + suggest_fuzzy_doc_expr_dict + text_en + + +---- + +When using these Suggesters in a query, you would define multiple 'suggest.dictionary' parameters in the request, referring to the names given for each Suggester in the search component definition. The response will include the terms in sections for each Suggester. See the <> section below for an example request and response. + +[[Suggester-AddingtheSuggestRequestHandler]] +=== Adding the Suggest Request Handler + +After adding the search component, a request handler must be added to `solrconfig.xml`. This request handler works the <>, and allows you to configure default parameters for serving suggestion requests. The request handler definition must incorporate the "suggest" search component defined previously. + +[source,xml] +---- + + + true + 10 + + + suggest + + +---- + +[[Suggester-SuggestRequestHandlerParameters]] +==== Suggest Request Handler Parameters + +The following parameters allow you to set defaults for the Suggest request handler: + +// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed + +[cols="30,70",options="header"] +|=== +|Parameter |Description +|suggest=true |This parameter should always be true, because we always want to run the Suggester for queries submitted to this handler. +|suggest.dictionary |The name of the dictionary component configured in the search component. This is a mandatory parameter. It can be set in the request handler, or sent as a parameter at query time. +|suggest.q |The query to use for suggestion lookups. +|suggest.count |Specifies the number of suggestions for Solr to return. +|suggest.cfq |A Context Filter Query used to filter suggestions based on the context field, if supported by the suggester. +|suggest.build |If true, it will build the suggester index. This is likely useful only for initial requests; you would probably not want to build the dictionary on every request, particularly in a production system. If you would like to keep your dictionary up to date, you should use the `buildOnCommit` or `buildOnOptimize` parameter for the search component. +|suggest.reload |If true, it will reload the suggester index. +|suggest.buildAll |If true, it will build all suggester indexes. +|suggest.reloadAll |If true, it will reload all suggester indexes. +|=== + +These properties can also be overridden at query time, or not set in the request handler at all and always sent at query time. + +.Context Filtering +[IMPORTANT] +==== +Context filtering (`suggest.cfq`) is currently only supported by AnalyzingInfixLookupFactory and BlendedInfixLookupFactory, and only when backed by a Document*Dictionary. All other implementations will return unfiltered matches as if filtering was not requested. +==== + +[[Suggester-ExampleUsages]] +== Example Usages + +[[Suggester-GetSuggestionswithWeights]] +=== Get Suggestions with Weights + +This is the basic suggestion using a single dictionary and a single Solr core. + +Example query: + +[source,text] +---- +http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true&suggest.dictionary=mySuggester&wt=json&suggest.q=elec +---- + +In this example, we've simply requested the string 'elec' with the suggest.q parameter and requested that the suggestion dictionary be built with suggest.build (note, however, that you would likely not want to build the index on every query - instead you should use buildOnCommit or buildOnOptimize if you have regularly changing documents). + +Example response: + +[source,json] +---- +{ + "responseHeader": { + "status": 0, + "QTime": 35 + }, + "command": "build", + "suggest": { + "mySuggester": { + "elec": { + "numFound": 3, + "suggestions": [ + { + "term": "electronics and computer1", + "weight": 2199, + "payload": "" + }, + { + "term": "electronics", + "weight": 649, + "payload": "" + }, + { + "term": "electronics and stuff2", + "weight": 279, + "payload": "" + } + ] + } + } + } +} +---- + +[[Suggester-MultipleDictionaries.1]] +=== Multiple Dictionaries + +If you have defined multiple dictionaries, you can use them in queries. + +Example query: + +[source,text] +---- +http://localhost:8983/solr/techproducts/suggest?suggest=true& \ + suggest.dictionary=mySuggester&suggest.dictionary=altSuggester&wt=json&suggest.q=elec +---- + +In this example we have sent the string 'elec' as the suggest.q parameter and named two suggest.dictionary definitions to be used. + +Example response: + +[source,json] +---- +{ + "responseHeader": { + "status": 0, + "QTime": 3 + }, + "suggest": { + "mySuggester": { + "elec": { + "numFound": 1, + "suggestions": [ + { + "term": "electronics and computer1", + "weight": 100, + "payload": "" + } + ] + } + }, + "altSuggester": { + "elec": { + "numFound": 1, + "suggestions": [ + { + "term": "electronics and computer1", + "weight": 10, + "payload": "" + } + ] + } + } + } +} +---- + +[[Suggester-ContextFiltering]] +=== Context Filtering + +Context filtering lets you filter suggestions by a separate context field, such as category, department or any other token. The AnalyzingInfixLookupFactory and BlendedInfixLookupFactory currently support this feature, when backed by DocumentDictionaryFactory. + +Add `contextField` to your suggester configuration. This example will suggest names and allow to filter by category: + +.solrconfig.xml +[source,xml] +---- + + + mySuggester + AnalyzingInfixLookupFactory + DocumentDictionaryFactory + name + price + cat + string + false + + +---- + +Example context filtering suggest query: + +[source,text] +---- +http://localhost:8983/solr/techproducts/suggest?suggest=true&suggest.build=true& \ + suggest.dictionary=mySuggester&wt=json&suggest.q=c&suggest.cfq=memory +---- + +The suggester will only bring back suggestions for products tagged with cat=memory. http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ccbc93b8/solr/solr-ref-guide/src/taking-solr-to-production.adoc ---------------------------------------------------------------------- diff --git a/solr/solr-ref-guide/src/taking-solr-to-production.adoc b/solr/solr-ref-guide/src/taking-solr-to-production.adoc new file mode 100644 index 0000000..833a5a7 --- /dev/null +++ b/solr/solr-ref-guide/src/taking-solr-to-production.adoc @@ -0,0 +1,287 @@ += Taking Solr to Production +:page-shortname: taking-solr-to-production +:page-permalink: taking-solr-to-production.html + +This section provides guidance on how to setup Solr to run in production on *nix platforms, such as Ubuntu. Specifically, we’ll walk through the process of setting up to run a single Solr instance on a Linux host and then provide tips on how to support multiple Solr nodes running on the same host. + +[[TakingSolrtoProduction-ServiceInstallationScript]] +== Service Installation Script + +Solr includes a service installation script (`bin/install_solr_service.sh`) to help you install Solr as a service on Linux. Currently, the script only supports CentOS, Debian, Red Hat, SUSE and Ubuntu Linux distributions. Before running the script, you need to determine a few parameters about your setup. Specifically, you need to decide where to install Solr and which system user should be the owner of the Solr files and process. + +[[TakingSolrtoProduction-Planningyourdirectorystructure]] +=== Planning your directory structure + +We recommend separating your live Solr files, such as logs and index files, from the files included in the Solr distribution bundle, as that makes it easier to upgrade Solr and is considered a good practice to follow as a system administrator. + +[[TakingSolrtoProduction-SolrInstallationDirectory]] +==== Solr Installation Directory + +By default, the service installation script will extract the distribution archive into `/opt`. You can change this location using the `-i` option when running the installation script. The script will also create a symbolic link to the versioned directory of Solr. For instance, if you run the installation script for Solr {solr-docs-version}.0, then the following directory structure will be used: + +[source,plain,subs="attributes"] +---- +/opt/solr-{solr-docs-version}.0 +/opt/solr -> /opt/solr-{solr-docs-version}.0 +---- + +Using a symbolic link insulates any scripts from being dependent on the specific Solr version. If, down the road, you need to upgrade to a later version of Solr, you can just update the symbolic link to point to the upgraded version of Solr. We’ll use `/opt/solr` to refer to the Solr installation directory in the remaining sections of this page. + +[[TakingSolrtoProduction-SeparateDirectoryforWritableFiles]] +==== Separate Directory for Writable Files + +You should also separate writable Solr files into a different directory; by default, the installation script uses `/var/solr`, but you can override this location using the `-d` option. With this approach, the files in `/opt/solr` will remain untouched and all files that change while Solr is running will live under `/var/solr`. + +[[TakingSolrtoProduction-CreatetheSolruser]] +=== Create the Solr user + +Running Solr as `root` is not recommended for security reasons, and the <> start command will refuse to do so. Consequently, you should determine the username of a system user that will own all of the Solr files and the running Solr process. By default, the installation script will create the *solr* user, but you can override this setting using the -u option. If your organization has specific requirements for creating new user accounts, then you should create the user before running the script. The installation script will make the Solr user the owner of the `/opt/solr` and `/var/solr` directories. + +You are now ready to run the installation script. + +[[TakingSolrtoProduction-RuntheSolrInstallationScript]] +=== Run the Solr Installation Script + +To run the script, you'll need to download the latest Solr distribution archive and then do the following: + +[source,bash,subs="attributes"] +---- +tar xzf solr-{solr-docs-version}.0.tgz solr-{solr-docs-version}.0/bin/install_solr_service.sh --strip-components=2 +---- + +The previous command extracts the `install_solr_service.sh` script from the archive into the current directory. If installing on Red Hat, please make sure *lsof* is installed before running the Solr installation script (`sudo yum install lsof`). The installation script must be run as root: + +[source,bash,subs="attributes"] +---- +sudo bash ./install_solr_service.sh solr-{solr-docs-version}.0.tgz +---- + +By default, the script extracts the distribution archive into `/opt`, configures Solr to write files into `/var/solr`, and runs Solr as the `solr` user. Consequently, the following command produces the same result as the previous command: + +[source,bash,subs="attributes"] +---- +sudo bash ./install_solr_service.sh solr-{solr-docs-version}.0.tgz -i /opt -d /var/solr -u solr -s solr -p 8983 +---- + +You can customize the service name, installation directories, port, and owner using options passed to the installation script. To see available options, simply do: + +[source,bash] +---- +sudo bash ./install_solr_service.sh -help +---- + +Once the script completes, Solr will be installed as a service and running in the background on your server (on port 8983). To verify, you can do: + +[source,bash] +---- +sudo service solr status +---- + +If you do not want to start the service immediately, pass the `-n` option. You can then start the service manually later, e.g., after completing the configuration setup. + +We'll cover some additional configuration settings you can make to fine-tune your Solr setup in a moment. Before moving on, let's take a closer look at the steps performed by the installation script. This gives you a better overview and will help you understand important details about your Solr installation when reading other pages in this guide; such as when a page refers to Solr home, you'll know exactly where that is on your system. + +[[TakingSolrtoProduction-SolrHomeDirectory]] +==== Solr Home Directory + +The Solr home directory (not to be confused with the Solr installation directory) is where Solr manages core directories with index files. By default, the installation script uses `/var/solr/data`. If the `-d` option is used on the install script, then this will change to the `data` subdirectory in the location given to the -d option. Take a moment to inspect the contents of the Solr home directory on your system. If you do not <>, the home directory must contain a `solr.xml` file. When Solr starts up, the Solr Control Script passes the location of the home directory using the `-Dsolr.solr.home=...` system property. + +[[TakingSolrtoProduction-Environmentoverridesincludefile]] +==== Environment overrides include file + +The service installation script creates an environment specific include file that overrides defaults used by the `bin/solr` script. The main advantage of using an include file is that it provides a single location where all of your environment-specific overrides are defined. Take a moment to inspect the contents of the `/etc/default/solr.in.sh` file, which is the default path setup by the installation script. If you used the `-s` option on the install script to change the name of the service, then the first part of the filename will be different. For a service named `solr-demo`, the file will be named `/etc/default/solr-demo.in.sh`. There are many settings that you can override using this file. However, at a minimum, this script needs to define the `SOLR_PID_DIR` and `SOLR_HOME` variables, such as: + +[source,bash] +---- +SOLR_PID_DIR=/var/solr +SOLR_HOME=/var/solr/data +---- + +The `SOLR_PID_DIR` variable sets the directory where the <> will write out a file containing the Solr server’s process ID. + +[[TakingSolrtoProduction-Logsettings]] +==== Log settings + +Solr uses Apache Log4J for logging. The installation script copies `/opt/solr/server/resources/log4j.properties` to `/var/solr/log4j.properties`. Take a moment to verify that the Solr include file is configured to send logs to the correct location by checking the following settings in `/etc/default/solr.in.sh`: + +[source,bash] +---- +LOG4J_PROPS=/var/solr/log4j.properties +SOLR_LOGS_DIR=/var/solr/logs +---- + +For more information about Log4J configuration, please see: <> + +[[TakingSolrtoProduction-init.dscript]] +==== init.d script + +When running a service like Solr on Linux, it’s common to setup an init.d script so that system administrators can control Solr using the service tool, such as: `service solr start`. The installation script creates a very basic init.d script to help you get started. Take a moment to inspect the `/etc/init.d/solr` file, which is the default script name setup by the installation script. If you used the `-s` option on the install script to change the name of the service, then the filename will be different. Notice that the following variables are setup for your environment based on the parameters passed to the installation script: + +[source,bash] +---- +SOLR_INSTALL_DIR=/opt/solr +SOLR_ENV=/etc/default/solr.in.sh +RUNAS=solr +---- + +The `SOLR_INSTALL_DIR` and `SOLR_ENV` variables should be self-explanatory. The `RUNAS` variable sets the owner of the Solr process, such as `solr`; if you don’t set this value, the script will run Solr as **root**, which is not recommended for production. You can use the `/etc/init.d/solr` script to start Solr by doing the following as root: + +[source,bash] +---- +service solr start +---- + +The `/etc/init.d/solr` script also supports the **stop**, **restart**, and *status* commands. Please keep in mind that the init script that ships with Solr is very basic and is intended to show you how to setup Solr as a service. However, it’s also common to use more advanced tools like *supervisord* or *upstart* to control Solr as a service on Linux. While showing how to integrate Solr with tools like supervisord is beyond the scope of this guide, the `init.d/solr` script should provide enough guidance to help you get started. Also, the installation script sets the Solr service to start automatically when the host machine initializes. + +[[TakingSolrtoProduction-ProgressCheck]] +=== Progress Check + +In the next section, we cover some additional environment settings to help you fine-tune your production setup. However, before we move on, let's review what we've achieved thus far. Specifically, you should be able to control Solr using `/etc/init.d/solr`. Please verify the following commands work with your setup: + +[source,bash] +---- +sudo service solr restart +sudo service solr status +---- + +The status command should give some basic information about the running Solr node that looks similar to: + +[source,text] +---- +Solr process PID running on port 8983 +{ + "version":"5.0.0 - ubuntu - 2014-12-17 19:36:58", + "startTime":"2014-12-19T19:25:46.853Z", + "uptime":"0 days, 0 hours, 0 minutes, 8 seconds", + "memory":"85.4 MB (%17.4) of 490.7 MB"} +---- + +If the `status` command is not successful, look for error messages in `/var/solr/logs/solr.log`. + +[[TakingSolrtoProduction-Finetuneyourproductionsetup]] +== Fine tune your production setup + +[[TakingSolrtoProduction-MemoryandGCSettings]] +=== Memory and GC Settings + +By default, the `bin/solr` script sets the maximum Java heap size to 512M (-Xmx512m), which is fine for getting started with Solr. For production, you’ll want to increase the maximum heap size based on the memory requirements of your search application; values between 10 and 20 gigabytes are not uncommon for production servers. When you need to change the memory settings for your Solr server, use the `SOLR_JAVA_MEM` variable in the include file, such as: + +[source,bash] +---- +SOLR_JAVA_MEM="-Xms10g -Xmx10g" +---- + +Also, the <> comes with a set of pre-configured Java Garbage Collection settings that have shown to work well with Solr for a number of different workloads. However, these settings may not work well for your specific use of Solr. Consequently, you may need to change the GC settings, which should also be done with the `GC_TUNE` variable in the `/etc/default/solr.in.sh` include file. For more information about tuning your memory and garbage collection settings, see: <>. + +[[TakingSolrtoProduction-Out-of-MemoryShutdownHook]] +==== Out-of-Memory Shutdown Hook + +The `bin/solr` script registers the `bin/oom_solr.sh` script to be called by the JVM if an OutOfMemoryError occurs. The `oom_solr.sh` script will issue a `kill -9` to the Solr process that experiences the `OutOfMemoryError`. This behavior is recommended when running in SolrCloud mode so that ZooKeeper is immediately notified that a node has experienced a non-recoverable error. Take a moment to inspect the contents of the `/opt/solr/bin/oom_solr.sh` script so that you are familiar with the actions the script will perform if it is invoked by the JVM. + +[[TakingSolrtoProduction-SolrCloud]] +=== SolrCloud + +To run Solr in SolrCloud mode, you need to set the `ZK_HOST` variable in the include file to point to your ZooKeeper ensemble. Running the embedded ZooKeeper is not supported in production environments. For instance, if you have a ZooKeeper ensemble hosted on the following three hosts on the default client port 2181 (zk1, zk2, and zk3), then you would set: + +[source,bash] +---- +ZK_HOST=zk1,zk2,zk3 +---- + +When the `ZK_HOST` variable is set, Solr will launch in "cloud" mode. + +[[TakingSolrtoProduction-ZooKeeperchroot]] +==== ZooKeeper chroot + +If you're using a ZooKeeper instance that is shared by other systems, it's recommended to isolate the SolrCloud znode tree using ZooKeeper's chroot support. For instance, to ensure all znodes created by SolrCloud are stored under `/solr`, you can put `/solr` on the end of your `ZK_HOST` connection string, such as: + +[source,bash] +---- +ZK_HOST=zk1,zk2,zk3/solr +---- + +Before using a chroot for the first time, you need to create the root path (znode) in ZooKeeper by using the <>. We can use the mkroot command for that: + +[source,bash] +---- +bin/solr zk mkroot /solr -z : +---- + +[NOTE] +==== + +If you also want to bootstrap ZooKeeper with existing `solr_home`, you can instead use use `zkcli.sh` / `zkcli.bat`'s `bootstrap` command, which will also create the chroot path if it does not exist. See <> for more info. + +==== + +[[TakingSolrtoProduction-SolrHostname]] +=== Solr Hostname + +Use the `SOLR_HOST` variable in the include file to set the hostname of the Solr server. + +[source,bash] +---- +SOLR_HOST=solr1.example.com +---- + +Setting the hostname of the Solr server is recommended, especially when running in SolrCloud mode, as this determines the address of the node when it registers with ZooKeeper. + +[[TakingSolrtoProduction-Overridesettingsinsolrconfig.xml]] +=== Override settings in solrconfig.xml + +Solr allows configuration properties to be overridden using Java system properties passed at startup using the `-Dproperty=value` syntax. For instance, in `solrconfig.xml`, the default auto soft commit settings are set to: + +[source,xml] +---- + + ${solr.autoSoftCommit.maxTime:-1} + +---- + +In general, whenever you see a property in a Solr configuration file that uses the `${solr.PROPERTY:DEFAULT_VALUE}` syntax, then you know it can be overridden using a Java system property. For instance, to set the maxTime for soft-commits to be 10 seconds, then you can start Solr with `-Dsolr.autoSoftCommit.maxTime=10000`, such as: + +[source,bash] +---- +bin/solr start -Dsolr.autoSoftCommit.maxTime=10000 +---- + +The `bin/solr` script simply passes options starting with `-D` on to the JVM during startup. For running in production, we recommend setting these properties in the `SOLR_OPTS` variable defined in the include file. Keeping with our soft-commit example, in `/etc/default/solr.in.sh`, you would do: + +[source,bash] +---- +SOLR_OPTS="$SOLR_OPTS -Dsolr.autoSoftCommit.maxTime=10000" +---- + +[[TakingSolrtoProduction-RunningmultipleSolrnodesperhost]] +== Running multiple Solr nodes per host + +The `bin/solr` script is capable of running multiple instances on one machine, but for a *typical* installation, this is not a recommended setup. Extra CPU and memory resources are required for each additional instance. A single instance is easily capable of handling multiple indexes. + +.When to ignore the recommendation +[NOTE] +==== + +For every recommendation, there are exceptions. For the recommendation above, that exception is mostly applicable when discussing extreme scalability. The best reason for running multiple Solr nodes on one host is decreasing the need for extremely large heaps. + +When the Java heap gets very large, it can result in extremely long garbage collection pauses, even with the GC tuning that the startup script provides by default. The exact point at which the heap is considered "very large" will vary depending on how Solr is used. This means that there is no hard number that can be given as a threshold, but if your heap is reaching the neighborhood of 16 to 32 gigabytes, it might be time to consider splitting nodes. Ideally this would mean more machines, but budget constraints might make that impossible. + +There is another issue once the heap reaches 32GB. Below 32GB, Java is able to use compressed pointers, but above that point, larger pointers are required, which uses more memory and slows down the JVM. + +Because of the potential garbage collection issues and the particular issues that happen at 32GB, if a single instance would require a 64GB heap, performance is likely to improve greatly if the machine is set up with two nodes that each have a 31GB heap. + +==== + +If your use case requires multiple instances, at a minimum you will need unique Solr home directories for each node you want to run; ideally, each home should be on a different physical disk so that multiple Solr nodes don’t have to compete with each other when accessing files on disk. Having different Solr home directories implies that you’ll need a different include file for each node. Moreover, if using the `/etc/init.d/solr` script to control Solr as a service, then you’ll need a separate script for each node. The easiest approach is to use the service installation script to add multiple services on the same host, such as: + +[source,bash,subs="attributes"] +---- +sudo bash ./install_solr_service.sh solr-{solr-docs-version}.0.tgz -s solr2 -p 8984 +---- + +The command shown above will add a service named `solr2` running on port 8984 using `/var/solr2` for writable (aka "live") files; the second server will still be owned and run by the `solr` user and will use the Solr distribution files in `/opt`. After installing the solr2 service, verify it works correctly by doing: + +[source,bash] +---- +sudo service solr2 restart +sudo service solr2 status +---- http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ccbc93b8/solr/solr-ref-guide/src/the-dismax-query-parser.adoc ---------------------------------------------------------------------- diff --git a/solr/solr-ref-guide/src/the-dismax-query-parser.adoc b/solr/solr-ref-guide/src/the-dismax-query-parser.adoc new file mode 100644 index 0000000..585e222 --- /dev/null +++ b/solr/solr-ref-guide/src/the-dismax-query-parser.adoc @@ -0,0 +1,215 @@ += The DisMax Query Parser +:page-shortname: the-dismax-query-parser +:page-permalink: the-dismax-query-parser.html + +The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input). + +In general, the DisMax query parser's interface is more like that of Google than the interface of the 'standard' Solr request handler. This similarity makes DisMax the appropriate query parser for many consumer applications. It accepts a simple syntax, and it rarely produces error messages. + +The DisMax query parser supports an extremely simplified subset of the Lucene QueryParser syntax. As in Lucene, quotes can be used to group phrases, and +/- can be used to denote mandatory and optional clauses. All other Lucene query parser special characters (except AND and OR) are escaped to simplify the user experience. The DisMax query parser takes responsibility for building a good query from the user's input using Boolean clauses containing DisMax queries across fields and boosts specified by the user. It also lets the Solr administrator provide additional boosting queries, boosting functions, and filtering queries to artificially affect the outcome of all searches. These options can all be specified as default parameters for the handler in the `solrconfig.xml` file or overridden in the Solr query URL. + +Interested in the technical concept behind the DisMax name? DisMax stands for Maximum Disjunction. Here's a definition of a Maximum Disjunction or "DisMax" query: + +[quote] +____ +A query that generates the union of documents produced by its subqueries, and that scores each document with the maximum score for that document as produced by any subquery, plus a tie breaking increment for any additional matching subqueries. +____ + +Whether or not you remember this explanation, do remember that the DisMax Query Parser was primarily designed to be easy to use and to accept almost any input without returning an error. + +[[TheDisMaxQueryParser-DisMaxParameters]] +== DisMax Parameters + +In addition to the common request parameter, highlighting parameters, and simple facet parameters, the DisMax query parser supports the parameters described below. Like the standard query parser, the DisMax query parser allows default parameter values to be specified in `solrconfig.xml`, or overridden by query-time values in the request. + +// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed + +[cols="30,70",options="header"] +|=== +|Parameter |Description +|<> |Defines the raw input strings for the query. +|<> |Calls the standard query parser and defines query input strings, when the q parameter is not used. +|<> |Query Fields: specifies the fields in the index on which to perform the query. If absent, defaults to `df`. +|<> |Minimum "Should" Match: specifies a minimum number of clauses that must match in a query. If no 'mm' parameter is specified in the query, or as a default in `solrconfig.xml`, the effective value of the `q.op` parameter (either in the query, as a default in `solrconfig.xml`, or from the `defaultOperator` option in the Schema) is used to influence the behavior. If `q.op` is effectively AND'ed, then mm=100%; if `q.op` is OR'ed, then mm=1. Users who want to force the legacy behavior should set a default value for the 'mm' parameter in their `solrconfig.xml` file. Users should add this as a configured default for their request handlers. This parameter tolerates miscellaneous white spaces in expressions (e.g., `" 3 < -25% 10 < -3\n", " \n-25%\n ", " \n3\n "`). +|<> |Phrase Fields: boosts the score of documents in cases where all of the terms in the q parameter appear in close proximity. +|<> |Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase. +|<> |Query Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase. Used specifically with the `qf` parameter. +|<> |Tie Breaker: specifies a float value (which should be something much less than 1) to use as tiebreaker in DisMax queries. Default: 0.0 +|<> |Boost Query: specifies a factor by which a term or phrase should be "boosted" in importance when considering a match. +|<> |Boost Functions: specifies functions to be applied to boosts. (See for details about function queries.) +|=== + +The sections below explain these parameters in detail. + +[[TheDisMaxQueryParser-TheqParameter]] +=== The `q` Parameter + +The `q` parameter defines the main "query" constituting the essence of the search. The parameter supports raw input strings provided by users with no special escaping. The + and - characters are treated as "mandatory" and "prohibited" modifiers for terms. Text wrapped in balanced quote characters (for example, "San Jose") is treated as a phrase. Any query containing an odd number of quote characters is evaluated as if there were no quote characters at all. + +[IMPORTANT] +==== + +The `q` parameter does not support wildcard characters such as *. + +==== + +[[TheDisMaxQueryParser-Theq.altParameter]] +=== The `q.alt` Parameter + +If specified, the `q.alt` parameter defines a query (which by default will be parsed using standard query parsing syntax) when the main q parameter is not specified or is blank. The `q.alt` parameter comes in handy when you need something like a query to match all documents (don't forget `&rows=0` for that one!) in order to get collection-wide faceting counts. + + +[[TheDisMaxQueryParser-Theqf_QueryFields_Parameter]] +=== The `qf` (Query Fields) Parameter + +The `qf` parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field's importance in the query. For example, the query below: + +`qf="fieldOne^2.3 fieldTwo fieldThree^0.4"` + +assigns `fieldOne` a boost of 2.3, leaves `fieldTwo` with the default boost (because no boost factor is specified), and `fieldThree` a boost of 0.4. These boost factors make matches in `fieldOne` much more significant than matches in `fieldTwo`, which in turn are much more significant than matches in `fieldThree`. + + +[[TheDisMaxQueryParser-Themm_MinimumShouldMatch_Parameter]] +=== The `mm` (Minimum Should Match) Parameter + +When processing queries, Lucene/Solr recognizes three types of clauses: mandatory, prohibited, and "optional" (also known as "should" clauses). By default, all words or phrases specified in the `q` parameter are treated as "optional" clauses unless they are preceded by a "+" or a "-". When dealing with these "optional" clauses, the `mm` parameter makes it possible to say that a certain minimum number of those clauses must match. The DisMax query parser offers great flexibility in how the minimum number can be specified. + +The table below explains the various ways that mm values can be specified. + +// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed + +[cols="30,10,60",options="header"] +|=== +|Syntax |Example |Description +|Positive integer |3 |Defines the minimum number of clauses that must match, regardless of how many clauses there are in total. +|Negative integer |-2 |Sets the minimum number of matching clauses to the total number of optional clauses, minus this value. +|Percentage |75% |Sets the minimum number of matching clauses to this percentage of the total number of optional clauses. The number computed from the percentage is rounded down and used as the minimum. +|Negative percentage |-25% |Indicates that this percent of the total number of optional clauses can be missing. The number computed from the percentage is rounded down, before being subtracted from the total to determine the minimum number. +|An expression beginning with a positive integer followed by a > or < sign and another value |3<90% |Defines a conditional expression indicating that if the number of optional clauses is equal to (or less than) the integer, they are all required, but if it's greater than the integer, the specification applies. In this example: if there are 1 to 3 clauses they are all required, but for 4 or more clauses only 90% are required. +|Multiple conditional expressions involving > or < signs |2<-25% 9<-3 |Defines multiple conditions, each one being valid only for numbers greater than the one before it. In the example at left, if there are 1 or 2 clauses, then both are required. If there are 3-9 clauses all but 25% are required. If there are more then 9 clauses, all but three are required. +|=== + +When specifying `mm` values, keep in mind the following: + +* When dealing with percentages, negative values can be used to get different behavior in edge cases. 75% and -25% mean the same thing when dealing with 4 clauses, but when dealing with 5 clauses 75% means 3 are required, but -25% means 4 are required. +* If the calculations based on the parameter arguments determine that no optional clauses are needed, the usual rules about Boolean queries still apply at search time. (That is, a Boolean query containing no required clauses must still match at least one optional clause). +* No matter what number the calculation arrives at, Solr will never use a value greater than the number of optional clauses, or a value less than 1. In other words, no matter how low or how high the calculated result, the minimum number of required matches will never be less than 1 or greater than the number of clauses. +* When searching across multiple fields that are configured with different query analyzers, the number of optional clauses may differ between the fields. In such a case, the value specified by mm applies to the maximum number of optional clauses. For example, if a query clause is treated as stopword for one of the fields, the number of optional clauses for that field will be smaller than for the other fields. A query with such a stopword clause would not return a match in that field if mm is set to 100% because the removed clause does not count as matched. + +The default value of `mm` is 100% (meaning that all clauses must match). + + +[[TheDisMaxQueryParser-Thepf_PhraseFields_Parameter]] +=== The `pf` (Phrase Fields) Parameter + +Once the list of matching documents has been identified using the `fq` and `qf` parameters, the `pf` parameter can be used to "boost" the score of documents in cases where all of the terms in the q parameter appear in close proximity. + +The format is the same as that used by the `qf` parameter: a list of fields and "boosts" to associate with each of them when making phrase queries out of the entire q parameter. + + +[[TheDisMaxQueryParser-Theps_PhraseSlop_Parameter]] +=== The `ps` (Phrase Slop) Parameter + +The `ps` parameter specifies the amount of "phrase slop" to apply to queries specified with the pf parameter. Phrase slop is the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query. + + +[[TheDisMaxQueryParser-Theqs_QueryPhraseSlop_Parameter]] +=== The `qs` (Query Phrase Slop) Parameter + +The `qs` parameter specifies the amount of slop permitted on phrase queries explicitly included in the user's query string with the `qf` parameter. As explained above, slop refers to the number of positions one token needs to be moved in relation to another token in order to match a phrase specified in a query. + + +[[TheDisMaxQueryParser-Thetie_TieBreaker_Parameter]] +=== The `tie` (Tie Breaker) Parameter + +The `tie` parameter specifies a float value (which should be something much less than 1) to use as tiebreaker in DisMax queries. + +When a term from the user's input is tested against multiple fields, more than one field may match. If so, each field will generate a different score based on how common that word is in that field (for each document relative to all other documents). The `tie` parameter lets you control how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field. + +A value of "0.0" - the default - makes the query a pure "disjunction max query": that is, only the maximum scoring subquery contributes to the final score. A value of "1.0" makes the query a pure "disjunction sum query" where it doesn't matter what the maximum scoring sub query is, because the final score will be the sum of the subquery scores. Typically a low value, such as 0.1, is useful. + + +[[TheDisMaxQueryParser-Thebq_BoostQuery_Parameter]] +=== The `bq` (Boost Query) Parameter + +The `bq` parameter specifies an additional, optional, query clause that will be added to the user's main query to influence the score. For example, if you wanted to add a relevancy boost for recent documents: + +[source,text] +---- +q=cheese +bq=date:[NOW/DAY-1YEAR TO NOW/DAY] +---- + +You can specify multiple `bq` parameters. If you want your query to be parsed as separate clauses with separate boosts, use multiple `bq` parameters. + + +[[TheDisMaxQueryParser-Thebf_BoostFunctions_Parameter]] +=== The `bf` (Boost Functions) Parameter + +The `bf` parameter specifies functions (with optional boosts) that will be used to construct FunctionQueries which will be added to the user's main query as optional clauses that will influence the score. Any function supported natively by Solr can be used, along with a boost value. For example: + +[source,text] +---- +recip(rord(myfield),1,2,3)^1.5 +---- + +Specifying functions with the bf parameter is essentially just shorthand for using the `bq` param combined with the `{!func}` parser. + +For example, if you want to show the most recent documents first, you could use either of the following: + +[source,text] +---- +bf=recip(rord(creationDate),1,1000,1000) + ...or... +bq={!func}recip(rord(creationDate),1,1000,1000) +---- + +[[TheDisMaxQueryParser-ExamplesofQueriesSubmittedtotheDisMaxQueryParser]] +== Examples of Queries Submitted to the DisMax Query Parser + +All of the sample URLs in this section assume you are running Solr's "techproducts" example: + +[source,bash] +---- +bin/solr -e techproducts +---- + +Normal results for the word "video" using the StandardRequestHandler with the default search field: + +`\http://localhost:8983/solr/techproducts/select?q=video&fl=name+score` + +The "dismax" handler is configured to search across the text, features, name, sku, id, manu, and cat fields all with varying boosts designed to ensure that "better" matches appear first, specifically: documents which match on the name and cat fields get higher scores. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video` + +Note that this instance is also configured with a default field list, which can be overridden in the URL. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=*,score` + +You can also override which fields are searched on and how much boost each field gets. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features\^20.0+text^0.3` + +You can boost results that have a field that matches a specific value. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&bq=cat:electronics^5.0` + +Another instance of the handler is registered using the `qt` "instock" and has slightly different configuration options, notably: a filter for (you guessed it) `inStock:true)`. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&fl=name,score,inStock` + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qt=instock&fl=name,score,inStock` + +One of the other really cool features in this handler is robust support for specifying the "BooleanQuery.minimumNumberShouldMatch" you want to be used based on how many terms are in your user's query. These allows flexibility for typos and partial matches. For the dismax handler, one and two word queries require that all of the optional clauses match, but for three to five word queries one missing word is allowed. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod` + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gibberish` + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+apple` + +Just like the StandardRequestHandler, it supports the debugQuery option to viewing the parsed query, and the score explanations for each document. + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=belkin+ipod+gibberish&debugQuery=true` + +`\http://localhost:8983/solr/techproducts/select?defType=dismax&q=video+card&debugQuery=true` http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ccbc93b8/solr/solr-ref-guide/src/the-extended-dismax-query-parser.adoc ---------------------------------------------------------------------- diff --git a/solr/solr-ref-guide/src/the-extended-dismax-query-parser.adoc b/solr/solr-ref-guide/src/the-extended-dismax-query-parser.adoc new file mode 100644 index 0000000..3a9481b --- /dev/null +++ b/solr/solr-ref-guide/src/the-extended-dismax-query-parser.adoc @@ -0,0 +1,247 @@ += The Extended DisMax Query Parser +:page-shortname: the-extended-dismax-query-parser +:page-permalink: the-extended-dismax-query-parser.html + +The Extended DisMax (eDisMax) query parser is an improved version of the <>. + +In addition to supporting all the DisMax query parser parameters, Extended Dismax: + +* supports the <>. +* supports queries such as AND, OR, NOT, -, and +. +* treats "and" and "or" as "AND" and "OR" in Lucene syntax mode.respects the 'magic field' names `\_val_` and `\_query_`. These are not a real fields in the Schema, but if used it helps do special things (like a function query in the case of `\_val_` or a nested query in the case of `\_query_`). If `\_val_` is used in a term or phrase query, the value is parsed as a function. +* includes improved smart partial escaping in the case of syntax errors; fielded queries, +/-, and phrase queries are still supported in this mode. +* improves proximity boosting by using word shingles; you do not need the query to match all words in the document before proximity boosting is applied. +* includes advanced stopword handling: stopwords are not required in the mandatory part of the query but are still used in the proximity boosting part. If a query consists of all stopwords, such as "to be or not to be", then all words are required. +* includes improved boost function: in Extended DisMax, the `boost` function is a multiplier rather than an addend, improving your boost results; the additive boost functions of DisMax (`bf` and `bq`) are also supported. +* supports pure negative nested queries: queries such as `+foo (-foo)` will match all documents. +* lets you specify which fields the end user is allowed to query, and to disallow direct fielded searches. + +[[TheExtendedDisMaxQueryParser-ExtendedDisMaxParameters]] +== Extended DisMax Parameters + +In addition to all the <>, Extended DisMax includes these query parameters: + +[[TheExtendedDisMaxQueryParser-ThesowParameter]] +=== The `sow` Parameter + +Split on whitespace: if set to `false`, whitespace-separated term sequences will be provided to text analysis in one shot, enabling proper function of analysis filters that operate over term sequences, e.g. multi-word synonyms and shingles. Defaults to `true`: text analysis is invoked separately for each individual whitespace-separated term. + +[[TheExtendedDisMaxQueryParser-Themm.autoRelaxParameter]] +=== The `mm.autoRelax` Parameter + +If true, the number of clauses required (<>) will automatically be relaxed if a clause is removed (by e.g. stopwords filter) from some but not all <> fields. Use this parameter as a workaround if you experience that queries return zero hits due to uneven stopword removal between the `qf` fields. + +Note that relaxing mm may cause undesired side effects, hurting the precision of the search, depending on the nature of your index content. + +[[TheExtendedDisMaxQueryParser-TheboostParameter]] +=== The `boost` Parameter + +A multivalued list of strings parsed as queries with scores multiplied by the score from the main query for all matching documents. This parameter is shorthand for wrapping the query produced by eDisMax using the `BoostQParserPlugin` + +[[TheExtendedDisMaxQueryParser-ThelowercaseOperatorsParameter]] +=== The `lowercaseOperators` Parameter + +A Boolean parameter indicating if lowercase "and" and "or" should be treated the same as operators "AND" and "OR". + +[[TheExtendedDisMaxQueryParser-ThepsParameter]] +=== The `ps` Parameter + +Default amount of slop on phrase queries built with `pf`, `pf2` and/or `pf3` fields (affects boosting). + +[[TheExtendedDisMaxQueryParser-Thepf2Parameter]] +=== The `pf2` Parameter + +A multivalued list of fields with optional weights, based on pairs of word shingles. + +[[TheExtendedDisMaxQueryParser-Theps2Parameter]] +=== The `ps2` Parameter + +This is similar to `ps` but overrides the slop factor used for `pf2`. If not specified, `ps` is used. + +[[TheExtendedDisMaxQueryParser-Thepf3Parameter]] +=== The `pf3` Parameter + +A multivalued list of fields with optional weights, based on triplets of word shingles. Similar to `pf`, except that instead of building a phrase per field out of all the words in the input, it builds a set of phrases for each field out of each triplet of word shingles. + +[[TheExtendedDisMaxQueryParser-Theps3Parameter]] +=== The `ps3` Parameter + +This is similar to `ps` but overrides the slop factor used for `pf3`. If not specified, `ps` is used. + +[[TheExtendedDisMaxQueryParser-ThestopwordsParameter]] +=== The `stopwords` Parameter + +A Boolean parameter indicating if the `StopFilterFactory` configured in the query analyzer should be respected when parsing the query: if it is false, then the `StopFilterFactory` in the query analyzer is ignored. + +[[TheExtendedDisMaxQueryParser-TheufParameter]] +=== The `uf` Parameter + +Specifies which schema fields the end user is allowed to explicitly query. This parameter supports wildcards. The default is to allow all fields, equivalent to `uf=\*`. To allow only title field, use `uf=title`. To allow title and all fields ending with '_s', use `uf=title,*_s`. To allow all fields except title, use `uf=*,-title`. To disallow all fielded searches, use `uf=-*`. + +[[TheExtendedDisMaxQueryParser-Fieldaliasingusingper-fieldqfoverrides]] +=== Field aliasing using per-field `qf` overrides + +Per-field overrides of the `qf` parameter may be specified to provide 1-to-many aliasing from field names specified in the query string, to field names used in the underlying query. By default, no aliasing is used and field names specified in the query string are treated as literal field names in the index. + +[[TheExtendedDisMaxQueryParser-ExamplesofQueriesSubmittedtotheExtendedDisMaxQueryParser]] +== Examples of Queries Submitted to the Extended DisMax Query Parser + +All of the sample URLs in this section assume you are running Solr's "```techproducts```" example: + +[source,bash] +---- +bin/solr -e techproducts +---- + +Boost the result of the query term "hello" based on the document's popularity: + +[source,text] +---- +http://localhost:8983/solr/techproducts/select?defType=edismax&q=hello&pf=text&qf=text&boost=popularity +---- + +Search for iPods OR video: + +[source,text] +---- +http://localhost:8983/solr/techproducts/select?defType=edismax&q=ipod+OR+video +---- + +Search across multiple fields, specifying (via boosts) how important each field is relative each other: + +[source,text] +---- +http://localhost:8983/solr/techproducts/select?q=video&defType=edismax&qf=features^20.0+text^0.3 +---- + +You can boost results that have a field that matches a specific value: + +[source,text] +---- +http://localhost:8983/solr/techproducts/select?q=video&defType=edismax&qf=features^20.0+text^0.3&bq=cat:electronics^5.0 +---- + +Using the "mm" param, 1 and 2 word queries require that all of the optional clauses match, but for queries with three or more clauses one missing clause is allowed: + +[source,text] +---- +http://localhost:8983/solr/techproducts/select?q=belkin+ipod&defType=edismax&mm=2 +http://localhost:8983/solr/techproducts/select?q=belkin+ipod+gibberish&defType=edismax&mm=2 +http://localhost:8983/solr/techproducts/select?q=belkin+ipod+apple&defType=edismax&mm=2 +---- + +In the example below, we see a per-field override of the `qf` parameter being used to alias "name" in the query string to either the "```last_name```" and "```first_name```" fields: + +[source,text] +---- +defType=edismax +q=sysadmin name:Mike +qf=title text last_name first_name +f.name.qf=last_name first_name +---- + +[[TheExtendedDisMaxQueryParser-Usingnegativeboost]] +== Using negative boost + +Negative query boosts have been supported at the "Query" object level for a long time (resulting in negative scores for matching documents). Now the QueryParsers have been updated to handle this too. + + +[[TheExtendedDisMaxQueryParser-Using_slop_]] +== Using 'slop' + +`Dismax` and `Edismax` can run queries against all query fields, and also run a query in the form of a phrase against the phrase fields. (This will work only for boosting documents, not actually for matching.) However, that phrase query can have a 'slop,' which is the distance between the terms of the query while still considering it a phrase match. For example: + +[source,text] +---- +q=foo bar +qf=field1^5 field2^10 +pf=field1^50 field2^20 +defType=dismax +---- + +With these parameters, the Dismax Query Parser generates a query that looks something like this: + +[source,text] +---- + (+(field1:foo^5 OR field2:foo^10) AND (field1:bar^5 OR field2:bar^10)) +---- + +But it also generates another query that will only be used for boosting results: + +[source,plain] +---- +field1:"foo bar"^50 OR field2:"foo bar"^20 +---- + +Thus, any document that has the terms "foo" and "bar" will match; however if some of those documents have both of the terms as a phrase, it will score much higher because it's more relevant. + +If you add the parameter `ps` (phrase slop), the second query will instead be: + +[source,text] +---- +ps=10 field1:"foo bar"~10^50 OR field2:"foo bar"~10^20 +---- + +This means that if the terms "foo" and "bar" appear in the document with less than 10 terms between each other, the phrase will match. For example the doc that says: + +[source,text] +---- +*Foo* term1 term2 term3 *bar* +---- + +will match the phrase query. + +How does one use phrase slop? Usually it is configured in the request handler (in `solrconfig`). + +With query slop (`qs`) the concept is similar, but it applies to explicit phrase queries from the user. For example, if you want to search for a name, you could enter: + +[source,text] +---- +q="Hans Anderson" +---- + +A document that contains "Hans Anderson" will match, but a document that contains the middle name "Christian" or where the name is written with the last name first ("Anderson, Hans") won't. For those cases one could configure the query field `qs`, so that even if the user searches for an explicit phrase query, a slop is applied. + +Finally, in addition to the phrase fields (`pf`) parameter, `edismax` also supports the `pf2` and `pf3` parameters, for fields over which to create bigram and trigram phrase queries. The phrase slop for these parameters' queries can be specified using the `ps2` and `ps3` parameters, respectively. If you use `pf2`/`pf3` but `ps2`/`ps3`, then the phrase slop for these parameters' queries will be taken from the `ps` parameter, if any. + + +[[TheExtendedDisMaxQueryParser-Usingthe_magicfields__val_and_query_]] +== Using the 'magic fields' `\_val_` and `\_query_` + +The Solr Query Parser's use of `\_val_` and `\_query_` differs from the Lucene Query Parser in the following ways: + +* If the magic field name `\_val_` is used in a term or phrase query, the value is parsed as a function. + +* It provides a hook into http://wiki.apache.org/solr/FunctionQuery[`FunctionQuery`] syntax. Quotes are necessary to encapsulate the function when it includes parentheses. For example: ++ +[source,text] +---- +_val_:myfield +_val_:"recip(rord(myfield),1,2,3)" +---- + +* The Solr Query Parser offers nested query support for any type of query parser (via QParserPlugin). Quotes are often necessary to encapsulate the nested query if it contains reserved characters. For example: ++ +[source,text] +---- +_query_:"{!dismax qf=myfield}how now brown cow" +---- + +Although not technically a syntax difference, note that if you use the Solr {solr-javadocs}/solr-core/org/apache/solr/schema/TrieDateField.html[`TrieDateField`] type, any queries on those fields (typically range queries) should use either the Complete ISO 8601 Date syntax that field supports, or the {solr-javadocs}/solr-core/org/apache/solr/util/DateMathParser.html[DateMath Syntax] to get relative dates. For example: + +[source,text] +---- +timestamp:[* TO NOW] +createdate:[1976-03-06T23:59:59.999Z TO *] +createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] +pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY] +createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR] +createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z] +---- + +[IMPORTANT] +==== + +`TO` must be uppercase, or Solr will report a 'Range Group' error. + +==== http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/ccbc93b8/solr/solr-ref-guide/src/the-query-elevation-component.adoc ---------------------------------------------------------------------- diff --git a/solr/solr-ref-guide/src/the-query-elevation-component.adoc b/solr/solr-ref-guide/src/the-query-elevation-component.adoc new file mode 100644 index 0000000..b56b718 --- /dev/null +++ b/solr/solr-ref-guide/src/the-query-elevation-component.adoc @@ -0,0 +1,138 @@ += The Query Elevation Component +:page-shortname: the-query-elevation-component +:page-permalink: the-query-elevation-component.html + +The https://wiki.apache.org/solr/QueryElevationComponent[Query Elevation Component] lets you configure the top results for a given query regardless of the normal Lucene scoring. + +This is sometimes called "sponsored search," "editorial boosting," or "best bets." This component matches the user query text to a configured map of top results. The text can be any string or non-string IDs, as long as it's indexed. Although this component will work with any QueryParser, it makes the most sense to use with <> or <>. + +The https://wiki.apache.org/solr/QueryElevationComponent[Query Elevation Component] is supported by distributed searching. + +All of the sample configuration and queries used in this section assume you are running Solr's "```techproducts```" example: + +[source,bash] +---- +bin/solr -e techproducts +---- + +[[TheQueryElevationComponent-ConfiguringtheQueryElevationComponent]] +== Configuring the Query Elevation Component + +You can configure the Query Elevation Component in the `solrconfig.xml` file. Search components like `QueryElevationComponent` may be added to any request handler; a dedicated request handler is used here for brevity. + +[source,xml] +---- + + + string + elevate.xml + + + + + explicit + + + elevator + + +---- + +Optionally, in the Query Elevation Component configuration you can also specify the following to distinguish editorial results from "normal" results: + +[source,xml] +---- +foo +---- + +The Query Elevation Search Component takes the following arguments: + +// TODO: Change column width to %autowidth.spread when https://github.com/asciidoctor/asciidoctor-pdf/issues/599 is fixed + +[cols="30,70",options="header"] +|=== +|Argument |Description +|`queryFieldType` |Specifies which fieldType should be used to analyze the incoming text. For example, it may be appropriate to use a fieldType with a LowerCaseFilter. +|`config-file` |Path to the file that defines query elevation. This file must exist in `/conf/` or `/`. If the file exists in the /conf/ directory it will be loaded once at startup. If it exists in the data directory, it will be reloaded for each IndexReader. +|`forceElevation` |By default, this component respects the requested `sort` parameter: if the request asks to sort by date, it will order the results by date. If `forceElevation=true` (the default), results will first return the boosted docs, then order by date. +|=== + +[[TheQueryElevationComponent-elevate.xml]] +=== `elevate.xml` + +Elevated query results are configured in an external XML file specified in the `config-file` argument. An `elevate.xml` file might look like this: + +[source,xml] +---- + + + + + + + + + + + + +---- + +In this example, the query "foo bar" would first return documents 1, 2 and 3, then whatever normally appears for the same query. For the query "ipod", it would first return "MA147LL/A", and would make sure that "IW-02" is not in the result set. + +[[TheQueryElevationComponent-UsingtheQueryElevationComponent]] +== Using the Query Elevation Component + +[[TheQueryElevationComponent-TheenableElevationParameter]] +=== The `enableElevation` Parameter + +For debugging it may be useful to see results with and without the elevated docs. To hide results, use `enableElevation=false`: + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=true` + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=false` + +[[TheQueryElevationComponent-TheforceElevationParameter]] +=== The `forceElevation` Parameter + +You can force elevation during runtime by adding `forceElevation=true` to the query URL: + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&enableElevation=true&forceElevation=true` + +[[TheQueryElevationComponent-TheexclusiveParameter]] +=== The `exclusive` Parameter + +You can force Solr to return only the results specified in the elevation file by adding `exclusive=true` to the URL: + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&debugQuery=true&exclusive=true` + +[[TheQueryElevationComponent-DocumentTransformersandthemarkExcludesParameter]] +=== Document Transformers and the `markExcludes` Parameter + +The `[elevated]` <> can be used to annotate each document with information about whether or not it was elevated: + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&fl=id,[elevated]` + +Likewise, it can be helpful when troubleshooting to see all matching documents – including documents that the elevation configuration would normally exclude. This is possible by using the `markExcludes=true` parameter, and then using the `[excluded]` transformer: + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&markExcludes=true&fl=id,[elevated],[excluded]` + +[[TheQueryElevationComponent-TheelevateIdsandexcludeIdsParameters]] +=== The `elevateIds` and `excludeIds` Parameters + +When the elevation component is in use, the pre-configured list of elevations for a query can be overridden at request time to use the unique keys specified in these request parameters. + +For example, in the request below documents 3007WFP and 9885A004 will be elevated, and document IW-02 will be excluded -- regardless of what elevations or exclusions are configured for the query "cable" in elevate.xml: + +`\http://localhost:8983/solr/techproducts/elevate?q=cable&df=text&excludeIds=IW-02&elevateIds=3007WFP,9885A004` + +If either one of these parameters is specified at request time, the the entire elevation configuration for the query is ignored. + +For example, in the request below documents IW-02 and F8V7067-APL-KIT will be elevated, and no documents will be excluded – regardless of what elevations or exclusions are configured for the query "ipod" in elevate.xml: + +`\http://localhost:8983/solr/techproducts/elevate?q=ipod&df=text&elevateIds=IW-02,F8V7067-APL-KIT` + +[[TheQueryElevationComponent-ThefqParameter]] +=== The `fq` Parameter + +Query elevation respects the standard filter query (`fq`) parameter. That is, if the query contains the `fq` parameter, all results will be within that filter even if `elevate.xml` adds other documents to the result set.