Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
Message-ID: <4C9F0CFC.60102@code972.com>
Date: Sun, 26 Sep 2010 11:06:04 +0200
From: Itamar Syn-Hershko <itamar@code972.com>
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
 rv:1.9.1.11) Gecko/20100711 Lightning/1.0b1 Thunderbird/3.0.6
MIME-Version: 1.0
To: java-user@lucene.apache.org
Subject: Re: finding the analyzer for a language...
References: <35868.1285379887@parc.com>
	<AANLkTinB+5AZOkxb=s6JknThA+Ks7nbu_xiG++XQyD+e@mail.gmail.com>
	<22393.1285442995@parc.com>
	<AANLkTi=c0ydL==iU_Gk0xNQTswsY=2iSC1Khip_oNGFc@mail.gmail.com>
	<4C9E7580.5090605@code972.com>
 <AANLkTinA-3eJoMwRR=EWFVhSCEt26+_1OW8A1wLnrYQB@mail.gmail.com>
In-Reply-To: <AANLkTinA-3eJoMwRR=EWFVhSCEt26+_1OW8A1wLnrYQB@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Shai, I was referring to your #2, which you already indicated in your 
reply wasn't part of the discussion.

Itamar.

On 26/9/2010 10:10 AM, Shai Erera wrote:
> The mapping is simply about returning the right Analyzer for the given
> Locale. You decide up front (as the Factory developer) what Analyzer /
> Tokenizer + TokenFilters combination you want to return for each language,
> and then when that language is input, you return it. That's it.
>
> Can you define mixed content? There are two possibilities:
>
> 1) Indexing documents of different languages. In that case, you need to know
> what's the document language, and then you use IndexWriter.addDocument(doc,
> analyzer) method, instead of relying on the default analyzer you pass to
> IndexWriterConfig.
>
> 2) Indexing documents that include text in multiple languages -- this is a
> complicated case and you need auto-language identification at the Tokenizer
> level. This is not the case where a Factory would be useful.
>
> Shai
>
> On Sun, Sep 26, 2010 at 12:19 AM, Itamar Syn-Hershko<itamar@code972.com>wrote:
>
>    
>> I may be missing the point here, but how do you define an analyzer<->
>> language match? What do you do in cases of mixed content, for example?
>>
>> Itamar.
>>
>>
>> On 25/9/2010 10:27 PM, Shai Erera wrote:
>>
>>      
>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>        
>>>> concerns
>>>> are it would be limited by javas Locale mechanism... but we can figure
>>>> this
>>>> out.
>>>>
>>>>
>>>>
>>>>          
>>>   It really depends how sophisticated you want such an AnalyzerFactory
>>> (that's how I call it in my code) to be. We can
>>> define it to be a factory for predefined languages (Locale-based) for the
>>> most common use cases. If you want to
>>> have tighter control over the Analyzer you create, you can still
>>> instantiate
>>> your own, or create a new one with a custom
>>> TokenFilters chain.
>>>
>>> As long as things are well documented, I don't see a reason why we cannot
>>> start simple and only if we find out
>>> that most users don't use 'simple' and prefer to be allowed to specify
>>> more
>>> parameters (such as 'word' or 'ngram') we
>>> bring complication into the game.
>>>
>>> I'm offering Locale 'cause in most web applications that I know of, the
>>> Locale is defined on the request and is often
>>> used to parse the user's query, translating strings etc.
>>>
>>> Anyway, it'd be great to have any such Factory, be it Locale based or not,
>>> because we have so many Analyzers
>>> already, and the way things stand today, any user, even the simplest one,
>>> who wishes to support multi-lingual search
>>> has to sift through all of them and decide what combination to use for
>>> each
>>> language. And if the user ends up picking
>>> default values, then a Factory would simplify matters for him.
>>>
>>> Shai
>>>
>>> On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen<janssen@parc.com>   wrote:
>>>
>>>
>>>
>>>        
>>>> Robert Muir<rcmuir@gmail.com>   wrote:
>>>>
>>>>
>>>>
>>>>          
>>>>> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen<janssen@parc.com>   wrote:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> I thought that since I'm updating UpLib's Lucene code, I should tackle
>>>>>> the issue of document languages, as well.  Right now I'm using an
>>>>>> off-the-shelf language identifier, textcat, to figure out which
>>>>>>
>>>>>>
>>>>>>              
>>>>> language
>>>>>            
>>>>
>>>>          
>>>>> a Web page or PDF is (mainly) written in.  I then want to analyze that
>>>>>            
>>>>>> document with an appropriate analyzer.  I'd then like to map to the
>>>>>> correct Lucene analyzer for that language, falling back to
>>>>>> StandardAnalyzer if the installed Lucene library doesn't have an
>>>>>> analyzer for that language.
>>>>>>
>>>>>> It would be *very* handy if Analyzer had a static method
>>>>>>
>>>>>>   static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag);
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> I agree (not sure if it should be in Analyzer itself, maybe we could
>>>>> make
>>>>>
>>>>>
>>>>>            
>>>> an
>>>>
>>>>
>>>>          
>>>>> Analyzer for this)...
>>>>>
>>>>>
>>>>>            
>>>> Not sure I followed that...  I wanted to be able to retrieve an instance
>>>> of an instantiated Analyzer class, the class that's "designed" to work
>>>> with that language, if one exists, otherwise null.  And to have you guys
>>>> keep that list up-to-date, instead of having to do it myself :-).
>>>> Seemed to me that's the standard kind of thing you make a static method
>>>> on the top-level class.
>>>>
>>>>
>>>>
>>>>          
>>>>> i mean it sounds like what you want, is for it to work in a similar way
>>>>>
>>>>>
>>>>>            
>>>> to
>>>>
>>>>
>>>>          
>>>>> ResourceBundle's fallback mechanism?
>>>>>
>>>>>
>>>>>            
>>>> I'm not sure that's appropriate.  I just want to retrieve an Analyzer
>>>> for that language, if such a thing exists.  If by "fallback", you mean
>>>> that "en-US" should just return EnglishAnalyzer if there's no analyzer
>>>> specifically for US usage -- yes, that's fine.  On the other hand, I
>>>> don't think there should be a fallback for languages which have no
>>>> macrolanguage Analyzer -- it should just return null or throw an
>>>> exception.  The programmer can then explicitly decide how do deal with
>>>> that response.
>>>>
>>>>
>>>>
>>>>          
>>>>> And I agree with your idea of rfc3066/4646, e.g. you might want to
>>>>>
>>>>>
>>>>>            
>>>> specify
>>>>
>>>>
>>>>          
>>>>> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for
>>>>> chinese somehow?
>>>>>
>>>>>
>>>>>            
>>>> Yes, good idea.  Might be interesting to see if those kind of subtags
>>>> can be registered with IANA, too.
>>>>
>>>> Although, if one is smart enough about Lucene and one's application to
>>>> make these kinds of judgement calls, I think one is probably smart
>>>> enough to know which class to use without consulting a generic
>>>> mechanism.
>>>>
>>>>
>>>>
>>>>          
>>>>> Shai Erera brought a similar idea up before, to use Locale, but my
>>>>>
>>>>>
>>>>>            
>>>> concerns
>>>>
>>>>
>>>>          
>>>>> are it would be limited by javas Locale mechanism... but we can figure
>>>>>
>>>>>
>>>>>            
>>>> this
>>>>
>>>>
>>>>          
>>>>> out.
>>>>>
>>>>> Maybe you want to create a JIRA issue to pursue this idea further? See
>>>>> http://wiki.apache.org/lucene-java/HowToContribute
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>            
>>>>>> Right now I'm consulting a hand-compiled mapping of
>>>>>> langtag-to-Lucene-classname to figure out which Analyzer to use.
>>>>>> Wearisome, and it will be out-of-date for future releases of Lucenen
>>>>>> which will presumably support more languages.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> yes, but it also brings up interesting backwards compatibility
>>>>>
>>>>>
>>>>>            
>>>> challenges.
>>>>
>>>>
>>>>          
>>>>> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade
>>>>> lucene then suddenly your Esperanto queries are analyzed differently
>>>>> (whereas they were dealt with by StandardAnalyzer before).
>>>>>
>>>>>
>>>>>            
>>>> Yes, presumably the Version would need to be used with this, too.
>>>>
>>>>
>>>>
>>>>          
>>>>> But this becomes less of a problem as we work on modularizing lucene, so
>>>>>
>>>>>
>>>>>            
>>>> we
>>>>
>>>>
>>>>          
>>>>> can remove Version from analyzers,
>>>>>
>>>>>
>>>>>            
>>>> Oh goody, another API change to cope with in my code.
>>>>
>>>>
>>>>
>>>>          
>>>>> and so you can just use an old analyzers
>>>>> jar file (such as 4.1) but upgrade your lucene core jar to say version
>>>>>
>>>>>
>>>>>            
>>>> 4.3.
>>>>
>>>>
>>>>          
>>>>>
>>>>>
>>>>>            
>>>>>> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way
>>>>>> to look "inside" it, and see what language it's for.  That's a problem
>>>>>> on the search side.  My QueryParser is a subclass of
>>>>>> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the
>>>>>> field "_query_language", i.e., "_query_language:de" to tell the query
>>>>>> parser to use a German analyzer on this query.  What I'd like to be
>>>>>>
>>>>>>
>>>>>>              
>>>>> able
>>>>>            
>>>>
>>>>          
>>>>> to do is interrogate the current analyzer attached to the query parser
>>>>>            
>>>>>> instance, and throw an exception if it's not for the specified
>>>>>>
>>>>>>
>>>>>>              
>>>>> language.
>>>>>            
>>>>
>>>>          
>>>>> I can do this for non-Snowball analyzers, because of the brittle
>>>>>            
>>>>>> hand-compiled mapping mentioned above.  But if it's a SnowballAnalyzer,
>>>>>> there's no way to tell what the language inside it is.  So it would be
>>>>>> nice if SnowballAnalyzer grew a method
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> SnowballAnalyzer had more problems. its actually deprecated in
>>>>> trunk/branch_3x and instead there is an Analyzer for each language
>>>>>
>>>>>
>>>>>            
>>>> (English,
>>>>
>>>>
>>>>          
>>>>> Italian, etc), which now has stopwords lists, and sometimes special
>>>>>
>>>>>
>>>>>            
>>>> behavior
>>>>
>>>>
>>>>          
>>>>> (e.g. Turkish lowercases differently).
>>>>>
>>>>> Put more simply, its an implementation detail for ItalianAnalyzer that
>>>>> we
>>>>> implement the stemming with SnowballFilter. One day we might change it
>>>>> to
>>>>> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter)
>>>>> by
>>>>> default.
>>>>>
>>>>>
>>>>>            
>>>> Ah, good.  That will suit my purposes nicely.
>>>>
>>>>
>>>>
>>>>          
>>>>> I'd really like to see the stopword work finished, so that a
>>>>>
>>>>>
>>>>>            
>>>>>> SnowballAnalyzer for a particular language has a decent set of
>>>>>> stopwords.
>>>>>>
>>>>>>
>>>>>>
>>>>>>              
>>>>> See above, I think this is finished? The remaining work is actually Solr
>>>>> integration.
>>>>>
>>>>>
>>>>>            
>>>> Excellent.  I looked at the JIRA, but some discussions just seem to
>>>> peter out, and I'm having a hard time telling what the resolution is.
>>>>
>>>>
>>>>
>>>>          
>>>>> In trunk and branch_3x, all the analyzers have their own package, here's
>>>>> Italian:
>>>>>
>>>>> Source package: contains Analyzer that uses SnowballFilter(Italian) and
>>>>> loads Italian snowball stopwords by default. It also includes an
>>>>> alternative, less aggressive stemmer.
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/
>>>>
>>>>
>>>>          
>>>>> The snowball stopwords were all added to the resources directory. This
>>>>> is
>>>>> where ItalianAnalyzer loads its set of stopwords from:
>>>>>
>>>>>
>>>>>
>>>>>            
>>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>>>
>>>>
>>>>          
>>>>> <
>>>>>
>>>>>
>>>>>            
>>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/
>>>>
>>>>
>>>>          
>>>>>
>>>>>            
>>>> I see there's also an explicit EnglishAnalyzer -- never thought it made
>>>> sense to call that StandardAnalyzer.  Great work!
>>>>
>>>> Bill
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>>        
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>      
>    

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org