jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Trivial Update of "Search" by JohnDorion
Date Sun, 26 Jun 2011 09:49:12 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "Search" page has been changed by JohnDorion:
http://wiki.apache.org/jackrabbit/Search?action=diff&rev1=31&rev2=32

  == Search ==
- 
  <<TableOfContents>>
  
  == Features ==
- 
  Node names and property values are indexed as soon as the data is saved or as soon as the
transaction is committed.
  
  Text extraction is done asynchronously in a in a background thread. That means changed or
added text is not available immediately, but after a short delay. The exact behavior can be
configured using the extractor* settings.
  
  == Search Configuration ==
- 
- The search index in Jackrabbit is pluggable and has a default implementation based on Apache
Lucene. It is configured in the file workspace.xml once the workspace is created. For new
workspaces, the configuration in the file repository.xml is used as a template. 
+ The search index in Jackrabbit is pluggable and has a default implementation based on Apache
Lucene. It is configured in the file workspace.xml once the workspace is created. For new
workspaces, the configuration in the file repository.xml is used as a template.
  
  To disable the search index, disable (comment out) the index configuration in the file repository.xml
and workspace.xml file(s).
  
  This default implementation has the following options:
+ ||'''Parameter''' ||'''Default Value''' ||'''Description''' ||'''Since''' ||
+ ||path ||''none'' ||The location of the index directory. This parameter is mandatory. A
reasonable value is: {{{${wsp.home}/index}}} ||1.0 ||
+ ||useCompoundFile ||true ||Advises lucene to use compound files for the index files. ||1.0
||
+ ||minMergeDocs ||100 ||Minimum number of nodes in an index until segments are merged ||1.0
||
+ ||volatileIdleTime ||3 ||Idle time in seconds until the volatile index part is moved to
a persistent index even though minMergeDocs is not reached. ||1.0 ||
+ ||maxMergeDocs ||100000, >=1.4: 2147483647 ||Maximum number of nodes in segments that
will be merged. The default value changed in Jackrabbit 1.4 to Integer.MAX_VALUE. ||1.0 ||
+ ||mergeFactor ||10 ||Determines how often segment indices are merged. ||1.0 ||
+ ||maxFieldLength ||10000 ||The number of words that are fulltext indexed at most per property.
||1.1 ||
+ ||bufferSize ||10 ||Maximum number of documents that are held in a pending queue until added
to the index ||1.0 ||
+ ||cacheSize ||1000 ||Size of the document number cache. This cache maps uuids to lucene
document numbers ||1.0 ||
+ ||forceConsistencyCheck ||false ||Runs a consistency check on every startup. If false, a
consistency check is only performed when the search index detects a prior forced shutdown.
||1.0 ||
+ ||autoRepair ||true ||Errors detected by a consistency check are automatically repaired.
If false, errors are only written to the log. ||1.0 ||
+ ||analyzer ||{{{org.apache.lucene.analysis.standard.StandardAnalyzer}}} ||Class name of
a lucene analyzer to use for fulltext indexing of text. ||1.0 ||
+ ||queryClass ||{{{org.apache.jackrabbit.core.query.QueryImpl}}} ||Class name that implements
the {{{javax.jcr.query.Query}}} interface. This class must also extend from the class: {{{org.apache.jackrabbit.core.query.AbstractQueryImpl}}}
||1.0 ||
+ ||respectDocumentOrder ||true, >=1.5: false ||If true and the query does not contain
an 'order by' clause, result nodes will be in document order. For better performance when
queries return a lot of nodes set to 'false' (In 1.5 'false' is now the default). ||1.0 ||
+ ||textFilterClasses ||{{{org.apache.jackrabbit.core.query.lucene.TextPlainTextFilter}}}
||Sets the list of text filters (and text extractors) to use for extracting text content from
binary properties. The list must be comma (or whitespace) separated, and contain fully qualified
class names of the {{{TextFilter}}} (and since 1.3 {{{TextExtractor}}} ) classes to be used.
The configured classes must all have a public default constructor. ||1.0 ||
+ ||resultFetchSize ||2147483647 ||The number of results the query handler should initially
fetch when a query is executed. Default value: Integer.MAX_VALUE (-> all) ||1.2.1 ||
+ ||extractorPoolSize ||0, >=1.5: twice #ofAvailProcessors ||Defines the maximum number
of background threads that are used to extract text from binary properties. If set to zero
no background threads are allocated and text extractors run in the current thread. ||1.3 ||
+ ||extractorTimeout ||100 ||A text extractor is executed using a background thread if it
doesn't finish within this timeout defined in milliseconds. This parameter has no effect if
extractorPoolSize is zero. ||1.3 ||
+ ||extractorBackLogSize ||100, >=1.6: 2147483647 ||The size of the extractor pool back
log. If all threads in the pool are busy, incomming work is put into a wait queue. If the
wait queue reaches the back log size, incomming extractor work will not be queued anymore
but will be executed with the current thread. ||1.3 ||
+ ||excerptProviderClass ||1.3: {{{org.apache.jackrabbit.core.query.lucene.DefaultXMLExcerpt}}},
>=1.4: {{{org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt}}} ||The name of the
class that implements {{{org.apache.jackrabbit.core.query.lucene.ExcerptProvider}}} and should
be used for the rep:excerpt() function in a query. ||1.3 ||
+ ||supportHighlighting ||false ||If set to {{{true}}} additional information is stored in
the index to support highlighting using the rep:excerpt() function. ||1.3 ||
+ ||synonymProviderClass ||''none'' ||The name of a class that implements {{{org.apache.jackrabbit.core.query.lucene.SynonymProvider}}}.
The default value is null (-> not set). ||1.4 ||
+ ||synonymProviderConfigPath ||''none'' ||The path to the synonym provider configuration
file. This path interpreted relative to the {{{path}}} parameter. If there is a {{{FileSystem}}}
element inside the {{{SearchIndex}}} element, then this path is interpreted relative to the
root path of the {{{FileSystem}}}. Whether this parameter is mandatory depends on the synonym
provider implementation. The default value is null (-> not set). ||1.4 ||
+ ||indexingConfiguration ||''none'' ||The path to the indexing configuration file. See also
IndexingConfiguration ||1.4 ||
+ ||indexingConfigurationClass ||{{{org.apache.jackrabbit.core.query.lucene.IndexingConfigurationImpl}}}
||The name of the class that implements {{{org.apache.jackrabbit.core.query.lucene.IndexingConfiguration}}}.
See also IndexingConfiguration. ||1.4 ||
+ ||enableConsistencyCheck ||false ||If set to {{{true}}} a consistency check is performed
depending on the parameter ''forceConsistencyCheck''. If set to {{{false}}} no consistency
check is performed on startup, even if a redo log had been applied. ||1.4 ||
+ ||spellCheckerClass ||''none'' ||The name of a class that implements {{{org.apache.jackrabbit.core.query.lucene.SpellChecker}}}.
See also SpellChecker ||1.4 ||
+ ||similarityClass ||Depends on what {{{Similarity.getDefault()}}} returns ||The name of
a class that extends {{{org.apache.lucene.search.Similarity}}}. ||1.5 ||
+ ||maxVolatileIndexSize ||1048576 ||The maximum volatile index size in bytes until it is
written to disk. The default value is 1MB. ||1.6 ||
+ ||initializeHierarchyCache ||true ||With the default value of {{{true}}} the hierarchy cache
is initialized on startup and control is only given back when the initialization has completed.
When set to {{{false}}} the cache is populated during regular use. ||1.6 ||
  
+ 
- || '''Parameter''' || '''Default Value''' || '''Description''' || '''Since''' ||
- || path || ''none'' || The location of the index directory. This parameter is mandatory.
A reasonable value is: {{{${wsp.home}/index}}} || 1.0 ||
- || useCompoundFile || true || Advises lucene to use compound files for the index files.
|| 1.0 ||
- || minMergeDocs || 100 || Minimum number of nodes in an index until segments are merged
|| 1.0 ||
- || volatileIdleTime || 3 || Idle time in seconds until the volatile index part is moved
to a persistent index even though minMergeDocs is not reached. || 1.0 ||
- || maxMergeDocs || 100000, >=1.4: 2147483647 || Maximum number of nodes in segments that
will be merged. The default value changed in Jackrabbit 1.4 to Integer.MAX_VALUE. || 1.0 ||
- || mergeFactor || 10 || Determines how often segment indices are merged. || 1.0 ||
- || maxFieldLength || 10000 || The number of words that are fulltext indexed at most per
property. || 1.1 ||
- || bufferSize || 10 || Maximum number of documents that are held in a pending queue until
added to the index || 1.0 ||
- || cacheSize || 1000 || Size of the document number cache. This cache maps uuids to lucene
document numbers || 1.0 ||
- || forceConsistencyCheck || false || Runs a consistency check on every startup. If false,
a consistency check is only performed when the search index detects a prior forced shutdown.
|| 1.0 ||
- || autoRepair || true || Errors detected by a consistency check are automatically repaired.
If false, errors are only written to the log. || 1.0 ||
- || analyzer || {{{org.apache.lucene.analysis.standard.StandardAnalyzer}}} || Class name
of a lucene analyzer to use for fulltext indexing of text. || 1.0 ||
- || queryClass || {{{org.apache.jackrabbit.core.query.QueryImpl}}} || Class name that implements
the {{{javax.jcr.query.Query}}} interface. This class must also extend from the class: {{{org.apache.jackrabbit.core.query.AbstractQueryImpl}}}
|| 1.0 ||
- || respectDocumentOrder || true, >=1.5: false || If true and the query does not contain
an 'order by' clause, result nodes will be in document order. For better performance when
queries return a lot of nodes set to 'false' (In 1.5 'false' is now the default). || 1.0 ||
- || textFilterClasses || {{{org.apache.jackrabbit.core.query.lucene.TextPlainTextFilter}}}
|| Sets the list of text filters (and text extractors) to use for extracting text content
from binary properties. The list must be comma (or whitespace) separated, and contain fully
qualified class names of the {{{TextFilter}}} (and since 1.3 {{{TextExtractor}}} ) classes
to be used. The configured classes must all have a public default constructor. || 1.0 ||
- || resultFetchSize || 2147483647 || The number of results the query handler should initially
fetch when a query is executed. Default value: Integer.MAX_VALUE (-> all) || 1.2.1 ||
- || extractorPoolSize || 0, >=1.5: twice #ofAvailProcessors || Defines the maximum number
of background threads that are used to extract text from binary properties. If set to zero
no background threads are allocated and text extractors run in the current thread. || 1.3
||
- || extractorTimeout || 100 || A text extractor is executed using a background thread if
it doesn't finish within this timeout defined in milliseconds. This parameter has no effect
if extractorPoolSize is zero. || 1.3 ||
- || extractorBackLogSize || 100, >=1.6: 2147483647 || The size of the extractor pool back
log. If all threads in the pool are busy, incomming work is put into a wait queue. If the
wait queue reaches the back log size, incomming extractor work will not be queued anymore
but will be executed with the current thread. || 1.3 ||
- || excerptProviderClass || 1.3: {{{org.apache.jackrabbit.core.query.lucene.DefaultXMLExcerpt}}},
>=1.4: {{{org.apache.jackrabbit.core.query.lucene.DefaultHTMLExcerpt}}} || The name of
the class that implements {{{org.apache.jackrabbit.core.query.lucene.ExcerptProvider}}} and
should be used for the rep:excerpt() function in a query. || 1.3 ||
- || supportHighlighting || false || If set to {{{true}}} additional information is stored
in the index to support highlighting using the rep:excerpt() function. || 1.3 ||
- || synonymProviderClass || ''none'' || The name of a class that implements {{{org.apache.jackrabbit.core.query.lucene.SynonymProvider}}}.
The default value is null (-> not set). || 1.4 ||
- || synonymProviderConfigPath || ''none'' || The path to the synonym provider configuration
file. This path interpreted relative to the {{{path}}} parameter. If there is a {{{FileSystem}}}
element inside the {{{SearchIndex}}} element, then this path is interpreted relative to the
root path of the {{{FileSystem}}}. Whether this parameter is mandatory depends on the synonym
provider implementation. The default value is null (-> not set). || 1.4 ||
- || indexingConfiguration || ''none'' || The path to the indexing configuration file. See
also [[IndexingConfiguration]] || 1.4 ||
- || indexingConfigurationClass || {{{org.apache.jackrabbit.core.query.lucene.IndexingConfigurationImpl}}}
|| The name of the class that implements {{{org.apache.jackrabbit.core.query.lucene.IndexingConfiguration}}}.
See also [[IndexingConfiguration]]. || 1.4 ||
- || enableConsistencyCheck || false || If set to {{{true}}} a consistency check is performed
depending on the parameter ''forceConsistencyCheck''. If set to {{{false}}} no consistency
check is performed on startup, even if a redo log had been applied. || 1.4 ||
- || spellCheckerClass || ''none'' || The name of a class that implements {{{org.apache.jackrabbit.core.query.lucene.SpellChecker}}}.
See also [[SpellChecker]] || 1.4 ||
- || similarityClass || Depends on what {{{Similarity.getDefault()}}} returns || The name
of a class that extends {{{org.apache.lucene.search.Similarity}}}. || 1.5 ||
- || maxVolatileIndexSize || 1048576 || The maximum volatile index size in bytes until it
is written to disk. The default value is 1MB. || 1.6 ||
- || initializeHierarchyCache || true || With the default value of {{{true}}} the hierarchy
cache is initialized on startup and control is only given back when the initialization has
completed. When set to {{{false}}} the cache is populated during regular use. || 1.6 ||
  
  
  '''Note''': all parameters (except path) have default values and can be omitted to use the
default.
  
- 
  == Proprietary Features ==
- 
  Jackrabbit supports some advanced features, which are not specified in JSR 170:
  
-  * Extract text from binary content: [[http://jackrabbit.apache.org/jackrabbit-text-extractors.html|TextExtractor]];
[[TextExtractorExamples]]
+  * Extract text from binary content: [[http://jackrabbit.apache.org/jackrabbit-text-extractors.html|TextExtractor]];
TextExtractorExamples
-  * Get a text excerpt with highlighted words that matched the query: [[ExcerptProvider]]
+  * Get a text excerpt with highlighted words that matched the query: ExcerptProvider
-  * Search for a term and its synonyms: [[SynonymSearch]]
+  * Search for a term and its synonyms: SynonymSearch
-  * Search for similar nodes: [[SimilaritySearch]]
+  * Search for similar nodes: SimilaritySearch
-  * Define index aggregates, rules and scores: [[IndexingConfiguration]]
+  * Define index aggregates, rules and scores: IndexingConfiguration
-  * Check spelling of a fulltext query statement: [[SpellChecker]]
+  * Check spelling of a fulltext query statement: SpellChecker
  
  == Fulltext Indexing of Chinese, Japanese and Korea ==
- 
  To index documents written in one of those languages, use the analyzer {{{org.apache.lucene.analysis.cjk.CJKAnalyzer}}}.
Due to a limitation of PDFBox, some PDF files may not be indexed at all or indexed correctly.
If this is the case, a warning message is written to the log file ("Failed to extract PDF
text content").
  
  == Rebuilding the Index ==
+ After a power outage or after killing the process, the index may become inconsistent. To
rebuild the index, stop Jackrabbit, delete the index directories, and start Jackrabbit. The
index will automatically be re-built. There is one index directory for each workspace at {{{<repositoryHome>/<workspaceName>/index}}},
plus one index directory for the version store at {{{<repositoryHome>/repository/index}}}.
- 
- After a power outage or after killing the process, the index may become inconsistent.
- To rebuild the index, stop Jackrabbit, delete the index directories, and start Jackrabbit.
The index will automatically be re-built.
- There is one index directory for each workspace at {{{<repositoryHome>/<workspaceName>/index}}},
plus one index directory for the version store at {{{<repositoryHome>/repository/index}}}.
  
  == Analyzing Query Performance ==
- 
  To get query statements and timings, set the following log level in log4j.xml:
  
  {{{
@@ -84, +75 @@

      <level value="debug"/>
  </logger>
  }}}
- 
  == SQL-2 ==
- 
- The default query language for JCR 2.0 is SQL-2. 
+ The default query language for JCR 2.0 is SQL-2.
  
   * [[http://www.h2database.com/jcr/grammar.html|Railroad diagrams]]
-  * [[http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit-spi-commons/src/test/resources/org/apache/jackrabbit/spi/commons/query/sql2/test.sql2.txt?view=markup|Examples
(actually test cases)]]
+  * [[http://svn.apache.org/viewvc/jackrabbit/trunk/jackrabbit-spi-commons/src/test/resources/org/apache/jackrabbit/spi/commons/query/sql2/test.sql2.txt?view=markup|Examples
(actually test cases)]] [[http://www.ehescheidung-jetzt.de|onlinescheidung]]
  
  == Further Development ==
+  * ReduceMemOfSharedFieldCache
  
-  * [[ReduceMemOfSharedFieldCache]]
- 

Mime
View raw message