lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] [Moved] (LUCENE-4229) latin text analysis
Date Tue, 17 Jul 2012 07:35:35 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler moved SOLR-3630 to LUCENE-4229:
---------------------------------------------

      Component/s:     (was: Schema and Analysis)
                   modules/analysis
    Lucene Fields: New
              Key: LUCENE-4229  (was: SOLR-3630)
          Project: Lucene - Java  (was: Solr)
    
> latin text analysis
> -------------------
>
>                 Key: LUCENE-4229
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4229
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Markus Klose
>            Priority: Minor
>         Attachments: SOLR-3630.patch, latin.analysis.jar, latinNumberTestData.zip, latinTestData.zip,
latin_analysis.png
>
>
> Hi
> a workmate and I played a bit with latin text analysis and created two filter for the
solr trunk version.
> One filter is designed for number conversion like 'iv' -> '4', 'v' -> '5', 'vi'
-> '6' ...
> The second filter is a stemmer for the most common suffixe.
> The following schema configuration could be a usecase for latin stemming.
> 	<fieldType name="text_latin" class="solr.TextField" positionIncrementGap="100">
> 		<analyzer>
> 			<tokenizer class="solr.StandardTokenizerFactory"/>
> 			<filter class="org.apache.solr.analysis.LatinNumberConvertFilterFactory" strictMode="true"/>
> 			<filter class="solr.KeywordMarkerFilterFactory" protected="latin_protwords.txt"
/>
> 			<filter class="org.apache.solr.analysis.LatinStemFilterFactory" />
> 		</analyzer>
> 	</fieldType>
> 	
> LatinNumberConvertFilterFactory has one property "strictMode" (default is false). This
boolean indicates in which way the computation of the value is done, because not all letter
combination are "valid" numbers. With strictMode="true" the output of "ic" is "ic"; With strictMode="false"
the output of "ic" is "99"
> The LatinStemFilterFactory generates for each input token two output token. the first
stemmed as noun and the second stemmed as verb.
> Both filter are aware of the KeywordMarkerFilterFactory.
> I have attached the svn patch for both filter. In addition I attached to zip files that
are needed by filter tests (TestLatinNumberConvertFilter, TestLatinStemFilter). I am sorry
for that but i did not find the option to include them into the patch, if there is one.
> The image latin_analysis.png is an example of the analysis done with the configuration
above. For this test we used the jar file latin.analysis.jar
> Have fun with latin text analysis. 
> It would be great to get some feedback.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message