lucene-solr-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Solr Wiki] Update of "S14ESSAddendum" by DavidSmiley
Date Wed, 16 Dec 2009 05:31:20 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Solr Wiki" for change notification.

The "S14ESSAddendum" page has been changed by DavidSmiley.
http://wiki.apache.org/solr/S14ESSAddendum

--------------------------------------------------

New page:
The book [[https://www.packtpub.com/solr-1-4-enterprise-search-server/book|Solr 1.4 Enterprise
Search Server]] aimed to cover all the features in Solr 1.4 but some features were overlooked
at the time of writing or were implemented after the book was published.  This document is
a listing of the missed content organized by the chapter it would most likely have been added
to.  There are some other known "features" in Solr that are not in the book and aren't here
because they are either internal to Solr or have dubious purpose or value.

== Chapter 2: Schema and Text Analysis ==

=== Trie based field types ===
The schema.xml used in the book examples has a schema version 1.1 instead of 1.2 which is
Solr 1.4's new default. The distinction is fairly trivial.  The bigger difference is that
Solr 1.4 defines a set of "Trie" based field types which are used in preference to the "Sortable"
based ones.  For example, there is now a `TrieIntField` using a field type named `tint` which
is to be used in preference to `SortableIntField` with a field type named `sint`.  The trie
field types have improved performance characteristics, particularly for range queries, and
they are of course sortable.  However, the "Sortable" field variants still do one thing that
the trie based fields cannot do which is the ability to specify `sortMissingLast` and `sortMissingFirst`.
 There is further documentation about these field types in the [[http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.0/example/solr/conf/schema.xml?revision=834197&view=markup|Solr
1.4 example schema.xml file]].

=== Text Analysis ===
 * !ReverseWildcardFilter
There is support for leading wildcards when using [[http://lucene.apache.org/solr/api/org/apache/solr/analysis/ReversedWildcardFilterFactory.html|ReverseWildcardFilterFactory]].
 See that link for some configuration options.  For example, using this filter allows a query
`*book` to match the text `cookbook`.  It essentially works by reversing the words as indexed
variants.  Be aware that this can more than double the number of indexed terms for the field
and thus increase the disk usage proportionally.  For a configuration snippet using this feature,
consider this sample field type definition excerpted from the unit tests:
{{{
    <fieldtype name="srev" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true"
            maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
      </analyzer>

      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>
    </fieldtype>
}}}
An interesting under-the-hood detail is that this filter requires the Solr query parsing code
to check for the presence of this filter to change its behavior -- something not true for
any other filter.

 * ASCIIFoldingFilter
For mapping of non-ascii characters to reasonable ASCII equivalents use `ASCIIFoldingFilterFactory`
which is best documented [[http://lucene.apache.org/java/2_9_1/api/all/org/apache/lucene/analysis/ASCIIFoldingFilter.html|here]].

 * !WordDelimiterFilter
There are a couple extra options for this filter not covered in the book.  One is an option
`stemEnglishPossessive` which is either 1 to enable (the default) or 0. When enabled it strips
off trailing `'s` on words. For example "O'Neil's" becomes "O", "Neil".  Another point is
that this filter supports the same `protected` attribute that the stemmer filters do so that
you can exclude certain input tokens listed in a configuration file from word delimiter processing.

=== Misc ===

 * copyField maxChars
The copyField directive in the schema can contain an optional `maxChars` attribute which puts
a cap on the number of characters copied. This is useful for copying potentially large text
fields into a catch-all searched field.

 * !ExternalFileField
There is a field type you can use called `ExternalFileField` that only works when referenced
in function queries.  As its name suggests, its data is in an external file instead of in
the index.  It's suitability is only for manipulating boosts of a document without requiring
re-indexing the document.  There is some [[http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html|rudimentary
javadocs]] but you'll want to search [[http://www.lucidimagination.com/search/?q=ExternalFileField|solr's
mailing list]] for further info.

== Chapter 3: Indexing Data ==

 * Duplicate detection
In some Solr usage situations you may need to prevent duplication where documents that are
the same could get added.  This is called ''deduplication''.  This doesn't have to do with
your unique key field, it is for when there is some other text field(s) that should be unique,
perhaps from a crawled file.  This feature is [[Deduplication|documented on Solr's wiki]].

=== Automatically Committing ===
The book discusses how to explicitly commit added data. Solr can also be configured to automatically
commit.  This feature is particularly useful when updating the index with changed data as
it occurs externally. 

 * autoCommit

In solrconfig.xml there is an <updateHandler> configuration element. Within it there
is the following XML commented in the default configuration:
{{{
    <autoCommit> 
      <maxDocs>10000</maxDocs>
      <maxTime>1000</maxTime> 
    </autoCommit>
}}}
You can specify `maxDocs` and/or `maxTime` depending on your needs.  `maxDocs` simply sets
a threshold at which a commit happens if there are this many documents not yet committed.
 Most useful is `maxTime` (milliseconds) which essentially sets a count-down timer from the
first document added after the previous commit for a commit to occur automatically.  The only
problem with using these is that it can't be disabled, which is something you might want to
do for bulk index loads.  Instead, consider `commitWithin` described below. 

 * commitWithin

When submitting documents to Solr, you can include a "commitWithin" attribute placed on the
`<add/>` XML element.  When >= 0, this tells Solr to perform a commit no later than
this number of milliseconds relative to the time Solr finishes processing the data.  Essentially
it acts as an override to solrconfig.xml / updateHandler / autoCommit / maxTime.

=== Misc ===

I'd like to simply re-emphasize that the book covered the `DataImportHandler` fairly lightly.
 For the latest documentation, [[DataImportHandler|go to Solr's wiki]].

 * !ContentStreamDataSource
One unique way to use the `DataImportHandler` is using the `ContentStreamDataSource`.  It
is like the `URLDataSource` except that instead of the DIH going out to fetch the XML, XML
can be POST'ed to the DIH from some other system (i.e. pull vs push).  Coupled together with
the DIH's XSLT support, this is fairly powerful.  The following is a snippet of `solrconfig.xml`
and then an entire DIH configuration file, referencing this `DataSource` type and using XSL.
{{{
  <requestHandler name="/update/musicbrainz" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">dih-musicbrainz-post-config.xml</str>
      <str name="optimize">false</str>
      <str name="clean">false</str>
      <str name="command">full-import</str>
    </lst>
  </requestHandler>
}}}
{{{
 <dataConfig>
  <dataSource type="ContentStreamDataSource" />
  <document>
    <entity name="mentity"
            xsl="xslt/musicbrains2solr.xsl"
            useSolrAddSchema="true"
            processor="XPathEntityProcessor">
    </entity>
  </document>
</dataConfig>
}}}

== Chapter 5: Enhanced Searching ==

 * QParserPlugin and !LocalParams syntax and subqueries
Another modification Solr has to Lucene's query syntax which is the use of {{{{!qparser name=value
name2=value2} yourquery}}}. That is, at the very beginning of a query you can use this syntax
to indicate a different query parser (optionally) and specify some so-called "local params"
name-value pairs too (again, optionally), used for certain advanced cases.  [[SolrQuerySyntax|Solr's
wiki]] has a bit more information on this. And in addition, there is a _query_ pseudo field
hack in the query syntax to support subqueries which is useful when used with the aforementioned
QParserPlugin syntax to change the query type.  Aside from Solr's wiki, you will also find
[[http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/|this blog post by
Yonik]] enlightening.

=== Function queries ===

The main reference for function queries is [[http://wiki.apache.org/solr/FunctionQuery|here
at Solr's wiki]].  The following are the ones not covered in the book:

 * sub(x,y)
Subtracts: x - y

 * query(subquery,default)
This one is a bit tough to understand. It yields the ''score'' for this document as found
from the given sub-query, defaulting to the 2nd argument if not found in that query.  There
are some interesting examples on the wiki.

 * ms(), ms(x), ms(x,y)
The `ms` function deals with times in milliseconds since the common 1970 epoch.  Arguments
either refer to a date field or it is a literal (ex: 2000-01-01T00:00:00Z ).  Without arguments
it returns the current time.  One argument will return the time referenced, probably a field
reference.  When there are two, it returns the difference `x-y`. This function is useful when
boosting more recent documents sooner. There is excellent information on this subject [[SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents|at
the wiki]].

 * Function Range Queries
Functions Queries can also be used for filtering searches.  Using the `frange` QParserPlugin,
you specify a numeric range applied on the given function query.  This advanced technique
is best described at 
[[http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/|Yonik's
blog post]] at Lucid Imationation. 

== Chapter 6: Search Components ==

=== Clustering Component ===

This is a Solr "contrib" module and was incorporated in the Solr 1.4 distribution near the
end of the book's release.  This component will "cluster" the search results based on statistical
similarity of terms.  It uses the [[http://project.carrot2.org|Carrot2]] open-source project
as the implementation of the underlying algorithm.  Clustering is useful for large text-heavy
indexes, especially when there is little/no structural information for faceting.

More details: [[ClusteringComponent]]

== Chapter 7: Deployment == 

 * XInclude
The `solrconfig.xml` file can be broken up into pieces and then included using the [[http://www.w3.org/TR/xinclude/|XInclude]]
spec. An example of this is the following line:
{{{ <xi:include href="solr/conf/solrconfig_master.xml" xmlns:xi="http://www.w3.org/2001/XInclude"/>
}}}
This is particularly useful when there are multiple Solr cores that require only slightly
different configurations. The common parts could be put into a file that is included into
each config.  There is [[SolrConfigXml#XInclude|more information about this]] at Solr's wiki.

== Chapter 8: Integrating Solr ==

 * !VelocityResponseWriter

Solr incorporates a contrib module called [[VelocityResponseWriter]] (AKA Solritas).  By using
a special request handler, you can rapidly construct user web front-ends using the [[http://velocity.apache.org/|Apache
Velocity]] templating system. It isn't expected that you would build sites with this, just
proof-of-concepts.

 * AJAX-Solr forks from SolrJs
[[http://wiki.github.com/evolvingweb/ajax-solr|AJAX Solr]] is another option for browser JavaScript
integration with Solr. Unlike SolrJs (from which it derives), AJAX-Solr is not tied to JQuery
or any other JavaScript framework for that matter. 

 * Native PHP support
PHP5 now has a [[http://us3.php.net/manual/en/book.solr.php|client API]] for interacting with
Solr.

== Chapter 9: Scaling Solr ==

 * partial optimize
If the index is so large that optimizes are taking longer than desired or using more disk
space during optimization than you can spare, consider adding the `maxSegments` parameter
to the optimize command.  In the XML message, this would be an attribute; the URL form and
SolrJ have the corresponding option too.  By default this parameter is 1 since an optimize
results in a single Lucene "segment".  By setting it larger than 1 but less than the `mergeFactor`,
you permit partial optimization to no more than this many segments.  Of course the index won't
be fully optimized and therefore searches will be slower.

Mime
View raw message