Return-Path: X-Original-To: apmail-lucene-solr-commits-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-commits-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1483ACDEC for ; Mon, 28 May 2012 11:40:28 +0000 (UTC) Received: (qmail 16767 invoked by uid 500); 28 May 2012 11:40:28 -0000 Delivered-To: apmail-lucene-solr-commits-archive@lucene.apache.org Received: (qmail 16374 invoked by uid 500); 28 May 2012 11:40:23 -0000 Mailing-List: contact solr-commits-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-dev@lucene.apache.org Delivered-To: mailing list solr-commits@lucene.apache.org Received: (qmail 16327 invoked by uid 99); 28 May 2012 11:40:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 May 2012 11:40:21 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.131] (HELO eos.apache.org) (140.211.11.131) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 May 2012 11:40:15 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id 373C436F; Mon, 28 May 2012 11:39:54 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable From: Apache Wiki To: Apache Wiki Date: Mon, 28 May 2012 11:39:53 -0000 Message-ID: <20120528113953.34853.12546@eos.apache.org> Subject: =?utf-8?q?=5BSolr_Wiki=5D_Trivial_Update_of_=22LanguageAnalysis=22_by_ior?= =?utf-8?q?ixxx?= Auto-Submitted: auto-generated Dear Wiki user, You have subscribed to a wiki page or wiki category on "Solr Wiki" for chan= ge notification. The "LanguageAnalysis" page has been changed by iorixxx: http://wiki.apache.org/solr/LanguageAnalysis?action=3Ddiff&rev1=3D24&rev2= =3D25 Comment: Tukish stopwords URL was corrected =3D Language Analysis =3D - = This page describes some of the language-specific analysis components ava= ilable in Solr. These components can be used to improve search results for = specific languages. = - Please look at [[AnalyzersTokenizersTokenFilters|AnalyzersTokenizersToken= Filters]] for other analysis components you can use in combination with the= se components. + Please look at AnalyzersTokenizersTokenFilters for other analysis compone= nts you can use in combination with these components. = - NOTE: This page is mostly '''obsolete'''. The [[http://svn.apache.org/rep= os/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xm= l|Solr Example]] now contains + NOTE: This page is mostly '''obsolete'''. The [[http://svn.apache.org/rep= os/asf/lucene/dev/branches/lucene_solr_3_6/solr/example/solr/conf/schema.xm= l|Solr Example]] now contains configurations for various languages as field= Types (text_XX). This is synchronized with the support from Lucene. - configurations for various languages as fieldTypes (text_XX). This is syn= chronized with the support from Lucene. = <> = =3D=3D By language =3D=3D - = =3D=3D=3D Arabic =3D=3D=3D Solr provides support for the [[http://www.mtholyoke.edu/~lballest/Pubs/a= rab_stem05.pdf|Light-10]] stemming algorithm, and Lucene includes an exampl= e stopword list. = @@ -24, +21 @@ ... }}} - = Example set of Arabic [[http://svn.apache.org/repos/asf/lucene/dev/trunk/= modules/analysis/common/src/resources/org/apache/lucene/analysis/ar/stopwor= ds.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Armenian =3D=3D=3D @@ -38, +34 @@ ... }}} - = Example set of Armenian [[http://svn.apache.org/repos/asf/lucene/dev/trun= k/modules/analysis/common/src/resources/org/apache/lucene/analysis/hy/stopw= ords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Basque =3D=3D=3D @@ -52, +47 @@ ... }}} - = Example set of Basque [[http://svn.apache.org/repos/asf/lucene/dev/trunk/= modules/analysis/common/src/resources/org/apache/lucene/analysis/eu/stopwor= ds.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Brazilian Portuguese =3D=3D=3D @@ -62, +56 @@ ... - ... = + ... }}} - = Example set of Brazilian Portuguese [[http://svn.apache.org/repos/asf/luc= ene/dev/trunk/lucene/analysis/common/src/resources/org/apache/lucene/analys= is/br/stopwords.txt|stopwords]] (Be sure to switch your browser encoding to= UTF-8) = =3D=3D=3D Bulgarian =3D=3D=3D @@ -78, +71 @@ ... }}} - = Example set of Bulgarian [[http://svn.apache.org/repos/asf/lucene/dev/tru= nk/modules/analysis/common/src/resources/org/apache/lucene/analysis/bg/stop= words.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Catalan =3D=3D=3D @@ -92, +84 @@ ... }}} - = Example set of Catalan [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/ca/stopwo= rds.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Chinese, Japanese, Korean =3D=3D=3D @@ -102, +93 @@ ... }}} - = [[Solr3.1]] Alternatively, for Simplified Chinese, Solr provides supp= ort for Chinese word segmentation {{{solr.SmartChineseWordTokenFilterFactor= y}}} in the analysis-extras contrib module. This component includes a large= dictionary and segments Chinese text into words with the Hidden Markov Mod= el. To use this filter, see solr/contrib/analysis-extras/README.txt for ins= tructions on which jars you need to add to your SOLR_HOME/lib = To use the default setup with fallback to English Porter stemmer for engl= ish words, use: + = {{{ }}} - = Or to configure your own analysis setup, use the SmartChineseSentenceToke= nizerFactory along with your custom filter setup. The sentence tokenizer to= kenizes on sentence boundaries and the SmartChineseWordTokenFilter breaks t= his further up into words. + = {{{ @@ -119, +110 @@ }}} - = - Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionF= ilterFactory|PositionFilter]] at query-time (only) as these languages do no= t use spaces between words. = + Note: Be sure to use [[AnalyzersTokenizersTokenFilters#solr.PositionF= ilterFactory|PositionFilter]] at query-time (only) as these languages do no= t use spaces between words. = =3D=3D=3D Czech =3D=3D=3D [[Solr3.1]] @@ -133, +123 @@ ... }}} - = Example set of Czech [[http://svn.apache.org/repos/asf/lucene/dev/trunk/m= odules/analysis/common/src/resources/org/apache/lucene/analysis/cz/stopword= s.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8)) = =3D=3D=3D Danish =3D=3D=3D @@ -145, +134 @@ ... }}} - = Example set of Danish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/= modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/d= anish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-= 8) = Note: See also {{{Decompounding}}} below. @@ -159, +147 @@ ... }}} - = An alternative stemmer (Kraaij-Pohlmann) can be used by specifying the la= nguage as "Kp". = Example set of Dutch [[http://svn.apache.org/repos/asf/lucene/dev/trunk/m= odules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/du= tch_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) @@ -175, +162 @@ ... }}} - = Note: The standard {{{PorterStemFilterFactory}}} is significantly fas= ter than {{{solr.SnowballPorterFilterFactory}}}. = - Larger example set English = - [[http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/commo= n/src/resources/org/apache/lucene/analysis/snowball/english_stop.txt|stopwo= rds]] + Larger example set English [[http://svn.apache.org/repos/asf/lucene/dev/= trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snow= ball/english_stop.txt|stopwords]] = =3D=3D=3D Finnish =3D=3D=3D Solr includes two stemmers for Finnish: one via {{{solr.SnowballPorterFil= terFactory}}}, and an alternative stemmer [[Solr3.1]] via {{{solr.Finni= shLightStemFilterFactory}}}. Lucene includes an example stopword list. @@ -190, +175 @@ ... }}} - = Example set of Finnish [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/= finnish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UT= F-8) = Note: See also {{{Decompounding}}} below. @@ -208, +192 @@ ... }}} - = Example set of French [[http://svn.apache.org/repos/asf/lucene/dev/trunk/= modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/f= rench_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-= 8) = Note: Its probably best to use the ElisionFilter before WordDelimiter= Filter. This will prevent very slow phrase queries. @@ -224, +207 @@ ... }}} - = Example set of Galician [[http://svn.apache.org/repos/asf/lucene/dev/trun= k/modules/analysis/common/src/resources/org/apache/lucene/analysis/gl/stopw= ords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D German =3D=3D=3D @@ -240, +222 @@ ... }}} - = Example set of German [[http://svn.apache.org/repos/asf/lucene/dev/trunk/= modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/g= erman_stop.txt|stopwords]] (Be sure to switch your browser encoding to UTF-= 8) = Note: See also {{{Decompounding}}} below. @@ -254, +235 @@ ... }}} - = Example set of Greek [[http://svn.apache.org/repos/asf/lucene/dev/trunk/m= odules/analysis/common/src/resources/org/apache/lucene/analysis/el/stopword= s.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = Note: Be sure to use the Greek-specific GreekLowerCaseFilterFactory + = =3D=3D=3D Hebrew =3D=3D=3D - = {{{ ... ... }}} Example set of Hebrew [[http://wiki.korotkin.co.il/Hebrew_stopwords|stopw= ords]] (Be sure to switch your browser encoding to UTF-8) + = =3D=3D=3D Hindi =3D=3D=3D [[Solr3.1]] = @@ -278, +259 @@ ... }}} - = Example set of Hindi [[http://svn.apache.org/repos/asf/lucene/dev/trunk/m= odules/analysis/common/src/resources/org/apache/lucene/analysis/hi/stopword= s.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Hungarian =3D=3D=3D - = Solr includes two stemmers for Hungarian: one via {{{solr.SnowballPorterF= ilterFactory}}}, and an alternative stemmer [[Solr3.1]] via {{{solr.Hun= garianLightStemFilterFactory}}}. Lucene includes an example stopword list. = {{{ @@ -291, +270 @@ ... }}} - = Example set of Hungarian [[http://svn.apache.org/repos/asf/lucene/dev/tru= nk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowbal= l/hungarian_stop.txt|stopwords]] (Be sure to switch your browser encoding t= o UTF-8) = Note: See also {{{Decompounding}}} below. @@ -309, +287 @@ ... }}} - = Example set of Indonesian [[http://svn.apache.org/repos/asf/lucene/dev/tr= unk/modules/analysis/common/src/resources/org/apache/lucene/analysis/id/sto= pwords.txt|stopwords]] = =3D=3D=3D Italian =3D=3D=3D @@ -321, +298 @@ ... }}} - = Example set of Italian [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/= italian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UT= F-8) = =3D=3D=3D Lao, Myanmar, Khmer =3D=3D=3D @@ -329, +305 @@ = Lucene provides support for segmenting these languages into syllables wit= h {{{solr.ICUTokenizerFactory}}} in the analysis-extras contrib module. To = use this tokenizer, see solr/contrib/analysis-extras/README.txt for instruc= tions on which jars you need to add to your SOLR_HOME/lib = - Note: Be sure to use PositionFilter at query-time (only) as these lan= guages do not use spaces between words. = + Note: Be sure to use PositionFilter at query-time (only) as these lan= guages do not use spaces between words. = =3D=3D=3D Norwegian =3D=3D=3D Solr includes support for stemming Norwegian via {{{solr.SnowballPorterFi= lterFactory}}}, and Lucene includes an example stopword list. Since [[S= olr3.6]] you can also use {{{solr.NorwegianLightStemFilterFactory}}} for a = lighter variant or {{{solr.NorwegianMinimalStemFilterFactory}}} attempting = to normalize plural endings only. These two are simple rule based stemmers,= not handing exceptions or irregular forms. @@ -340, +316 @@ ... }}} - = Example set of Norwegian [[http://svn.apache.org/repos/asf/lucene/dev/tru= nk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowbal= l/norwegian_stop.txt|stopwords]] (Be sure to switch your browser encoding t= o UTF-8) = Note: See also {{{Decompounding}}} below. @@ -354, +329 @@ ... }}} - = Example set of Persian [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/fa/stopwo= rds.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = Note: WordDelimiterFilter does not split on joiners by default. You c= an solve this by using ArabicLetterTokenizerFactory, which does, or by usin= g a custom WordDelimiterFilterFactory which supplies a customized charTypeT= able to WordDelimiterFilter. In either case, consider using PositionFilter = at query-time (only), as the QueryParser does not consider joiners and coul= d create unwanted phrase queries. @@ -362, +336 @@ =3D=3D=3D Polish =3D=3D=3D [[Solr3.1]] = + Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilt= erFactory}}} in the analysis-extras contrib module. This component includes= an algorithmic stemmer with tables for Polish. To use this filter, see sol= r/contrib/analysis-extras/README.txt for instructions on which jars you nee= d to add to your SOLR_HOME/lib - Lucene provides support for Polish stemming {{{solr.StempelPolishStemFilt= erFactory}}} in the analysis-extras contrib module. This component includes= an algorithmic stemmer with tables for Polish. - To use this filter, see solr/contrib/analysis-extras/README.txt for instr= uctions on which jars you need to add to your SOLR_HOME/lib = {{{ ... @@ -371, +344 @@ ... }}} - = Example set of Polish [[http://svn.apache.org/repos/asf/lucene/dev/trunk/= modules/analysis/stempel/src/resources/org/apache/lucene/analysis/pl/stopwo= rds.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Portuguese =3D=3D=3D @@ -383, +355 @@ ... }}} - = Example set of Portuguese [[http://svn.apache.org/repos/asf/lucene/dev/tr= unk/modules/analysis/common/src/resources/org/apache/lucene/analysis/snowba= ll/portuguese_stop.txt|stopwords]] (Be sure to switch your browser encoding= to UTF-8) = =3D=3D=3D Romanian =3D=3D=3D @@ -395, +366 @@ ... }}} - = Example set of Romanian [[http://svn.apache.org/repos/asf/lucene/dev/trun= k/modules/analysis/common/src/resources/org/apache/lucene/analysis/ro/stopw= ords.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) = =3D=3D=3D Russian =3D=3D=3D @@ -407, +377 @@ ... }}} - = Example set of Russian [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/= russian_stop.txt|stopwords]] (Be sure to switch your browser encoding to UT= F-8) = =3D=3D=3D Spanish =3D=3D=3D @@ -419, +388 @@ ... }}} - = Example set of Spanish [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/= spanish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UT= F-8) = =3D=3D=3D Swedish =3D=3D=3D @@ -431, +399 @@ ... }}} - = Example set of Swedish [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/snowball/= swedish_stop.txt|stopwords]] (Be sure to switch your browser encoding to UT= F-8) = Note: See also {{{Decompounding}}} below. @@ -444, +411 @@ ... }}} - = Note: Be sure to use PositionFilter at query-time (only) as this lang= uage does not use spaces between words. = =3D=3D=3D Turkish =3D=3D=3D @@ -456, +422 @@ ... }}} - = - Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /modules/analysis/common/src/resources/org/apache/lucene/analysis/tr/stopwo= rds.txt|stopwords]] (Be sure to switch your browser encoding to UTF-8) + Example set of Turkish [[http://svn.apache.org/repos/asf/lucene/dev/trunk= /solr/example/solr/conf/lang/stopwords_tr.txt|stopwords]] (Be sure to switc= h your browser encoding to UTF-8) = Note: Be sure to use the Turkish-specific TurkishLowerCaseFilterFacto= ry [[Solr3.1]] = =3D=3D My language is not listed!!! =3D=3D - = Your language might work anyway. A first step is to start with the "textg= en" type in the example schema. Remember, things like stemming and stopword= s aren't mandatory for the search to work, only optional language-specific = improvements. = If you have problems (your language is highly-inflectional, etc), you mig= ht want to try using an n-gram approach as an alternative. = =3D=3D Other Tips =3D=3D =3D=3D=3D Tokenization =3D=3D=3D - = In general most languages don't require special tokenization (and will wo= rk just fine with Whitespace + WordDelimiterFilter), so you can safely tail= or the English "text" example schema definition to fit. = =3D=3D=3D Ignoring Case =3D=3D=3D - = - In most cases LowerCaseFilterFactory is sufficient. = - However, some languages have special casing properties, and these have th= eir own filters: + In most cases LowerCaseFilterFactory is sufficient. However, some langua= ges have special casing properties, and these have their own filters: = * TurkishLowerCaseFilterFactory: Use this instead of LowerCaseFilterFact= ory for the Turkish language. It includes special handling for [[http://en.= wikipedia.org/wiki/Dotted_and_dotless_I|dotted and dotless I]]. * GreekLowerCaseFilterFactory: Use this instead of LowerCaseFilterFactor= y for the Greek language. It removes Greek diacritics and has special handl= ing for the Greek final sigma. = =3D=3D=3D Ignoring Diacritics =3D=3D=3D - = Some languages use diacritics, but people are not always consistent about= typing them in queries or documents. = If you are using a stemmer, most stemmers (especially Snowball stemmers) = are a bit forgiving about diacritics, and these are handled on a language-s= pecific basis. @@ -493, +453 @@ For other languages, the ASCIIFoldingFilterFactory won't do the foldings = that you need. One solution is to use {{{solr.analysis.ICUFoldingFilterFact= ory}}} [[Solr3.1]], which implements a [[http://unicode.org/reports/tr3= 0/tr30-4.html|similar idea]] across all of Unicode = =3D=3D=3D Stopwords =3D=3D=3D - = Stopwords affect Solr in three ways: relevance, performance, and resource= utilization. = From a relevance perspective, these extremely high-frequency terms tend t= o throw off the scoring algorithm, and you won't get very good results if y= ou leave them. At the same time, if you remove them, you can return bad res= ults when the stopword is actually important. @@ -505, +464 @@ One tradeoff you can make if you have the disk space: You can use CommonG= ramsFilter/CommonGramsQueryFilter instead of StopFilter. This solves the re= levance and performance problems, at the expense of even more resource util= ization, because it will form bigrams of stopwords to their adjacent words. = =3D=3D=3D Stemming =3D=3D=3D - = Stemming can help improve relevance, but it can also hurt. = There is no general rule for whether or not to stem: It depends not only = on the language, but also on the properties of your documents and queries. = - Lucene/Solr provides different stemmers, and for some languages you may h= ave multiple choices. Some are algorithmic based, others are dictionary bas= ed. = + Lucene/Solr provides different stemmers, and for some languages you may h= ave multiple choices. Some are algorithmic based, others are dictionary bas= ed. = The Snowball stemmers rely on algorithms and considered fairly aggressive= , but for many languages (see above) Solr provides alternatives that are le= ss aggressive. In many situations a lighter approach yields better relevanc= e: often "less is more". The light stemmers typically target the most commo= n noun/adjective inflections, and perhaps a few derivational suffixes. The = minimal stemmers are even more conservative and may only remove plural endi= ngs. The new Hunspell stemmers are both dictionary and rule based and may p= rovide a tighter stemming than Snowball for some languages. = @@ -522, +480 @@ [[Solr3.5]] The Hunspell stemmers are configured through the Hunspell= StemFilterFactory combined with a dictionary and an affix file. Hunspell su= pports 99 languages. = =3D=3D=3D=3D Notes about solr.PorterStemFilterFactory =3D=3D=3D=3D - = Porter stemmer for the English language. = Standard Lucene implementation of the [[http://tartarus.org/~martin/Porte= rStemmer/|Porter Stemming Algorithm]], a normalization process that removes= common endings from words. = - Example: "riding", "rides", "horses" =3D=3D> "ride", "ride", "hors". + . Example: "riding", "rides", "horses" =3D=3D> "ride", "ride", "hors". = + Note: This differs very slightly from the "Porter" algorithm available in= `solr.SnowballPorterFilter`, in that it deviates slightly from the publish= ed algorithm. For more details, see the section "Points of difference from = the published algorithm" described [[http://tartarus.org/~martin/PorterStem= mer/|here]]. - Note: This differs very slightly from the "Porter" algorithm available in= `solr.SnowballPorterFilter`, in that it deviates slightly from the publish= ed algorithm. - For more details, see the section "Points of difference from the publishe= d algorithm" described [[http://tartarus.org/~martin/PorterStemmer/|here]]. = Porter is approximately twice as fast as using SnowballPorterFilterFactor= y. = @@ -540, +496 @@ KStem is considerably faster than SnowballPorterFilterFactory. = <> + = =3D=3D=3D=3D Notes about solr.SnowballPorterFilterFactory =3D=3D=3D=3D - = Creates `org.apache.lucene.analysis.SnowballPorterFilter`. = Creates an [[http://snowball.tartarus.org/texts/stemmersoverview.html|Sno= wball stemmer]] from the Java classes generated from a [[http://snowball.ta= rtarus.org/|Snowball]] specification. The language attribute is used to sp= ecify the language of the stemmer. + = {{{ @@ -553, +510 @@ }}} - = Valid values for the language attribute (creates the snowball stemmer cla= ss language + "Stemmer"): + = * [[http://snowball.tartarus.org/algorithms/armenian/stemmer.html|Armeni= an]] [[Lucene3.1]] * [[http://snowball.tartarus.org/algorithms/basque/stemmer.html|Basque]]= [[Lucene3.1]] * [[http://snowball.tartarus.org/algorithms/catalan/stemmer.html|Catalan= ]] [[Lucene3.1]] @@ -579, +536 @@ * [[http://snowball.tartarus.org/algorithms/turkish/stemmer.html|Turkish= ]] = Gotchas: + = * Although the Lovins stemmer is described as faster than Porter/Porter2= , practically it is much slower in Solr, as it is implemented using reflect= ion. * Neither the Lovins nor the Finnish stemmer produce correct output (as = of Solr 1.4), due to a [[http://article.gmane.org/gmane.comp.search.snowbal= l/1139|known bug in Snowball]] * The Turkish stemmer requires special lowercasing. You should use Turki= shLowerCaseFilter instead of LowerCaseFilter with this language. See [[http= ://en.wikipedia.org/wiki/Dotted_and_dotless_I|background information]]. * The stemmers are sensitive to diacritics. Think carefully before remov= ing these with something like `ASCIIFoldingFilterFactory` before stemming, = as this could cause unwanted results. For example, `r=C3=A9sum=C3=A9` will = not be stemmed by the Porter stemmer, but `resume` will be stemmed to `resu= m`, causing it to match with `resumed`, `resuming`, etc. The differences ca= n be more profound for non-english stemmers. = <> + = =3D=3D=3D Customizing Stemming =3D=3D=3D - = Sometimes a stemmer might not do what you want out-of-box. For example, y= ou might be happy with the results on average, but have a few particular ca= ses (such as Product Names or similar) where it annoys you or actually hurt= s your search results. = The components below allow you to fine-tune the stemming process by preve= nting words from being stemmed at all, or by overriding the stemming algori= thm with custom mappings. @@ -609, +567 @@ }}} - = =3D=3D=3D=3D solr.StemmerOverrideFilterFactory =3D=3D=3D=3D [[Solr3.1]] = @@ -628, +585 @@ }}} - = <> + = =3D=3D=3D Decompounding =3D=3D=3D - = Decompounding can improve search results for some languages. At the same = time, it can increase the time it takes to index and search, as well as inc= rease the index size itself. = Solr provides dictionary-based decompounding support via solr.DictionaryC= ompoundWordTokenFilterFactory. This factory allows you to provide a diction= ary, along with some settings (min/max subword size, etc), to break compoun= d words into pieces.