lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Itamar Syn-Hershko <ita...@code972.com>
Subject Re: finding the analyzer for a language...
Date Sat, 25 Sep 2010 22:19:44 GMT
I may be missing the point here, but how do you define an analyzer <-> 
language match? What do you do in cases of mixed content, for example?

Itamar.

On 25/9/2010 10:27 PM, Shai Erera wrote:
>> Shai Erera brought a similar idea up before, to use Locale, but my concerns
>> are it would be limited by javas Locale mechanism... but we can figure this
>> out.
>>
>>      
>   It really depends how sophisticated you want such an AnalyzerFactory
> (that's how I call it in my code) to be. We can
> define it to be a factory for predefined languages (Locale-based) for the
> most common use cases. If you want to
> have tighter control over the Analyzer you create, you can still instantiate
> your own, or create a new one with a custom
> TokenFilters chain.
>
> As long as things are well documented, I don't see a reason why we cannot
> start simple and only if we find out
> that most users don't use 'simple' and prefer to be allowed to specify more
> parameters (such as 'word' or 'ngram') we
> bring complication into the game.
>
> I'm offering Locale 'cause in most web applications that I know of, the
> Locale is defined on the request and is often
> used to parse the user's query, translating strings etc.
>
> Anyway, it'd be great to have any such Factory, be it Locale based or not,
> because we have so many Analyzers
> already, and the way things stand today, any user, even the simplest one,
> who wishes to support multi-lingual search
> has to sift through all of them and decide what combination to use for each
> language. And if the user ends up picking
> default values, then a Factory would simplify matters for him.
>
> Shai
>
> On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen<janssen@parc.com>  wrote:
>
>    
>> Robert Muir<rcmuir@gmail.com>  wrote:
>>
>>      
>>> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen<janssen@parc.com>  wrote:
>>>
>>>        
>>>> I thought that since I'm updating UpLib's Lucene code, I should tackle
>>>> the issue of document languages, as well.  Right now I'm using an
>>>> off-the-shelf language identifier, textcat, to figure out which
>>>>          
>> language
>>      
>>>> a Web page or PDF is (mainly) written in.  I then want to analyze that
>>>> document with an appropriate analyzer.  I'd then like to map to the
>>>> correct Lucene analyzer for that language, falling back to
>>>> StandardAnalyzer if the installed Lucene library doesn't have an
>>>> analyzer for that language.
>>>>
>>>> It would be *very* handy if Analyzer had a static method
>>>>
>>>>   static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
>>>>
>>>>          
>>> I agree (not sure if it should be in Analyzer itself, maybe we could make
>>>        
>> an
>>      
>>> Analyzer for this)...
>>>        
>> Not sure I followed that...  I wanted to be able to retrieve an instance
>> of an instantiated Analyzer class, the class that's "designed" to work
>> with that language, if one exists, otherwise null.  And to have you guys
>> keep that list up-to-date, instead of having to do it myself :-).
>> Seemed to me that's the standard kind of thing you make a static method
>> on the top-level class.
>>
>>      
>>> i mean it sounds like what you want, is for it to work in a similar way
>>>        
>> to
>>      
>>> ResourceBundle's fallback mechanism?
>>>        
>> I'm not sure that's appropriate.  I just want to retrieve an Analyzer
>> for that language, if such a thing exists.  If by "fallback", you mean
>> that "en-US" should just return EnglishAnalyzer if there's no analyzer
>> specifically for US usage -- yes, that's fine.  On the other hand, I
>> don't think there should be a fallback for languages which have no
>> macrolanguage Analyzer -- it should just return null or throw an
>> exception.  The programmer can then explicitly decide how do deal with
>> that response.
>>
>>      
>>> And I agree with your idea of rfc3066/4646, e.g. you might want to
>>>        
>> specify
>>      
>>> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
>>> chinese somehow?
>>>        
>> Yes, good idea.  Might be interesting to see if those kind of subtags
>> can be registered with IANA, too.
>>
>> Although, if one is smart enough about Lucene and one's application to
>> make these kinds of judgement calls, I think one is probably smart
>> enough to know which class to use without consulting a generic
>> mechanism.
>>
>>      
>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>        
>> concerns
>>      
>>> are it would be limited by javas Locale mechanism... but we can figure
>>>        
>> this
>>      
>>> out.
>>>
>>> Maybe you want to create a JIRA issue to pursue this idea further? See
>>> http://wiki.apache.org/lucene-java/HowToContribute
>>>
>>>
>>>        
>>>> Right now I'm consulting a hand-compiled mapping of
>>>> langtag-to-Lucene-classname to figure out which Analyzer to use.
>>>> Wearisome, and it will be out-of-date for future releases of Lucenen
>>>> which will presumably support more languages.
>>>>
>>>>          
>>> yes, but it also brings up interesting backwards compatibility
>>>        
>> challenges.
>>      
>>> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
>>> lucene then suddenly your Esperanto queries are analyzed differently
>>> (whereas they were dealt with by StandardAnalyzer before).
>>>        
>> Yes, presumably the Version would need to be used with this, too.
>>
>>      
>>> But this becomes less of a problem as we work on modularizing lucene, so
>>>        
>> we
>>      
>>> can remove Version from analyzers,
>>>        
>> Oh goody, another API change to cope with in my code.
>>
>>      
>>> and so you can just use an old analyzers
>>> jar file (such as 4.1) but upgrade your lucene core jar to say version
>>>        
>> 4.3.
>>      
>>>
>>>        
>>>> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
>>>> to look "inside" it, and see what language it's for.  That's a problem
>>>> on the search side.  My QueryParser is a subclass of
>>>> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
>>>> field "_query_language", i.e., "_query_language:de" to tell the query
>>>> parser to use a German analyzer on this query.  What I'd like to be
>>>>          
>> able
>>      
>>>> to do is interrogate the current analyzer attached to the query parser
>>>> instance, and throw an exception if it's not for the specified
>>>>          
>> language.
>>      
>>>> I can do this for non-Snowball analyzers, because of the brittle
>>>> hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
>>>> there's no way to tell what the language inside it is.  So it would be
>>>> nice if SnowballAnalyzer grew a method
>>>>
>>>>          
>>> SnowballAnalyzer had more problems. its actually deprecated in
>>> trunk/branch_3x and instead there is an Analyzer for each language
>>>        
>> (English,
>>      
>>> Italian, etc), which now has stopwords lists, and sometimes special
>>>        
>> behavior
>>      
>>> (e.g. Turkish lowercases differently).
>>>
>>> Put more simply, its an implementation detail for ItalianAnalyzer that we
>>> implement the stemming with SnowballFilter. One day we might change it to
>>> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) by
>>> default.
>>>        
>> Ah, good.  That will suit my purposes nicely.
>>
>>      
>>> I'd really like to see the stopword work finished, so that a
>>>        
>>>> SnowballAnalyzer for a particular language has a decent set of
>>>> stopwords.
>>>>
>>>>          
>>> See above, I think this is finished? The remaining work is actually Solr
>>> integration.
>>>        
>> Excellent.  I looked at the JIRA, but some discussions just seem to
>> peter out, and I'm having a hard time telling what the resolution is.
>>
>>      
>>> In trunk and branch_3x, all the analyzers have their own package, here's
>>> Italian:
>>>
>>> Source package: contains Analyzer that uses SnowballFilter(Italian) and
>>> loads Italian snowball stopwords by default. It also includes an
>>> alternative, less aggressive stemmer.
>>>
>>>        
>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
>>      
>>> The snowball stopwords were all added to the resources directory. This is
>>> where ItalianAnalyzer loads its set of stopwords from:
>>>
>>>        
>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>      
>>> <
>>>        
>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>      
>>>        
>> I see there's also an explicit EnglishAnalyzer -- never thought it made
>> sense to call that StandardAnalyzer.  Great work!
>>
>> Bill
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message