Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 21132 invoked from network); 26 Sep 2010 09:07:24 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 26 Sep 2010 09:07:24 -0000 Received: (qmail 19358 invoked by uid 500); 26 Sep 2010 09:07:22 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18757 invoked by uid 500); 26 Sep 2010 09:07:18 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18749 invoked by uid 99); 26 Sep 2010 09:07:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 26 Sep 2010 09:07:17 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [188.121.53.1] (HELO n1plout04-01.prod.ams1.secureserver.net) (188.121.53.1) by apache.org (qpsmtpd/0.29) with SMTP; Sun, 26 Sep 2010 09:07:09 +0000 Received: (qmail 6147 invoked from network); 26 Sep 2010 09:06:44 -0000 Received: from unknown (109.160.209.235) by n1plout04-01.prod.ams1.secureserver.net (188.121.53.1) with ESMTP; 26 Sep 2010 09:06:11 -0000 Message-ID: <4C9F0CFC.60102@code972.com> Date: Sun, 26 Sep 2010 11:06:04 +0200 From: Itamar Syn-Hershko User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100711 Lightning/1.0b1 Thunderbird/3.0.6 MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: finding the analyzer for a language... References: <35868.1285379887@parc.com> <22393.1285442995@parc.com> <4C9E7580.5090605@code972.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Shai, I was referring to your #2, which you already indicated in your reply wasn't part of the discussion. Itamar. On 26/9/2010 10:10 AM, Shai Erera wrote: > The mapping is simply about returning the right Analyzer for the given > Locale. You decide up front (as the Factory developer) what Analyzer / > Tokenizer + TokenFilters combination you want to return for each language, > and then when that language is input, you return it. That's it. > > Can you define mixed content? There are two possibilities: > > 1) Indexing documents of different languages. In that case, you need to know > what's the document language, and then you use IndexWriter.addDocument(doc, > analyzer) method, instead of relying on the default analyzer you pass to > IndexWriterConfig. > > 2) Indexing documents that include text in multiple languages -- this is a > complicated case and you need auto-language identification at the Tokenizer > level. This is not the case where a Factory would be useful. > > Shai > > On Sun, Sep 26, 2010 at 12:19 AM, Itamar Syn-Hershkowrote: > > >> I may be missing the point here, but how do you define an analyzer<-> >> language match? What do you do in cases of mixed content, for example? >> >> Itamar. >> >> >> On 25/9/2010 10:27 PM, Shai Erera wrote: >> >> >>> Shai Erera brought a similar idea up before, to use Locale, but my >>> >>>> concerns >>>> are it would be limited by javas Locale mechanism... but we can figure >>>> this >>>> out. >>>> >>>> >>>> >>>> >>> It really depends how sophisticated you want such an AnalyzerFactory >>> (that's how I call it in my code) to be. We can >>> define it to be a factory for predefined languages (Locale-based) for the >>> most common use cases. If you want to >>> have tighter control over the Analyzer you create, you can still >>> instantiate >>> your own, or create a new one with a custom >>> TokenFilters chain. >>> >>> As long as things are well documented, I don't see a reason why we cannot >>> start simple and only if we find out >>> that most users don't use 'simple' and prefer to be allowed to specify >>> more >>> parameters (such as 'word' or 'ngram') we >>> bring complication into the game. >>> >>> I'm offering Locale 'cause in most web applications that I know of, the >>> Locale is defined on the request and is often >>> used to parse the user's query, translating strings etc. >>> >>> Anyway, it'd be great to have any such Factory, be it Locale based or not, >>> because we have so many Analyzers >>> already, and the way things stand today, any user, even the simplest one, >>> who wishes to support multi-lingual search >>> has to sift through all of them and decide what combination to use for >>> each >>> language. And if the user ends up picking >>> default values, then a Factory would simplify matters for him. >>> >>> Shai >>> >>> On Sat, Sep 25, 2010 at 9:29 PM, Bill Janssen wrote: >>> >>> >>> >>> >>>> Robert Muir wrote: >>>> >>>> >>>> >>>> >>>>> On Fri, Sep 24, 2010 at 9:58 PM, Bill Janssen wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> I thought that since I'm updating UpLib's Lucene code, I should tackle >>>>>> the issue of document languages, as well. Right now I'm using an >>>>>> off-the-shelf language identifier, textcat, to figure out which >>>>>> >>>>>> >>>>>> >>>>> language >>>>> >>>> >>>> >>>>> a Web page or PDF is (mainly) written in. I then want to analyze that >>>>> >>>>>> document with an appropriate analyzer. I'd then like to map to the >>>>>> correct Lucene analyzer for that language, falling back to >>>>>> StandardAnalyzer if the installed Lucene library doesn't have an >>>>>> analyzer for that language. >>>>>> >>>>>> It would be *very* handy if Analyzer had a static method >>>>>> >>>>>> static Analyzer getAnalyzerForLanguage(String rfc_4646_lang_tag); >>>>>> >>>>>> >>>>>> >>>>>> >>>>> I agree (not sure if it should be in Analyzer itself, maybe we could >>>>> make >>>>> >>>>> >>>>> >>>> an >>>> >>>> >>>> >>>>> Analyzer for this)... >>>>> >>>>> >>>>> >>>> Not sure I followed that... I wanted to be able to retrieve an instance >>>> of an instantiated Analyzer class, the class that's "designed" to work >>>> with that language, if one exists, otherwise null. And to have you guys >>>> keep that list up-to-date, instead of having to do it myself :-). >>>> Seemed to me that's the standard kind of thing you make a static method >>>> on the top-level class. >>>> >>>> >>>> >>>> >>>>> i mean it sounds like what you want, is for it to work in a similar way >>>>> >>>>> >>>>> >>>> to >>>> >>>> >>>> >>>>> ResourceBundle's fallback mechanism? >>>>> >>>>> >>>>> >>>> I'm not sure that's appropriate. I just want to retrieve an Analyzer >>>> for that language, if such a thing exists. If by "fallback", you mean >>>> that "en-US" should just return EnglishAnalyzer if there's no analyzer >>>> specifically for US usage -- yes, that's fine. On the other hand, I >>>> don't think there should be a fallback for languages which have no >>>> macrolanguage Analyzer -- it should just return null or throw an >>>> exception. The programmer can then explicitly decide how do deal with >>>> that response. >>>> >>>> >>>> >>>> >>>>> And I agree with your idea of rfc3066/4646, e.g. you might want to >>>>> >>>>> >>>>> >>>> specify >>>> >>>> >>>> >>>>> subtags like "word" (SmartChineseAnalyzer) or "ngram" (CJKAnalyzer) for >>>>> chinese somehow? >>>>> >>>>> >>>>> >>>> Yes, good idea. Might be interesting to see if those kind of subtags >>>> can be registered with IANA, too. >>>> >>>> Although, if one is smart enough about Lucene and one's application to >>>> make these kinds of judgement calls, I think one is probably smart >>>> enough to know which class to use without consulting a generic >>>> mechanism. >>>> >>>> >>>> >>>> >>>>> Shai Erera brought a similar idea up before, to use Locale, but my >>>>> >>>>> >>>>> >>>> concerns >>>> >>>> >>>> >>>>> are it would be limited by javas Locale mechanism... but we can figure >>>>> >>>>> >>>>> >>>> this >>>> >>>> >>>> >>>>> out. >>>>> >>>>> Maybe you want to create a JIRA issue to pursue this idea further? See >>>>> http://wiki.apache.org/lucene-java/HowToContribute >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Right now I'm consulting a hand-compiled mapping of >>>>>> langtag-to-Lucene-classname to figure out which Analyzer to use. >>>>>> Wearisome, and it will be out-of-date for future releases of Lucenen >>>>>> which will presumably support more languages. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> yes, but it also brings up interesting backwards compatibility >>>>> >>>>> >>>>> >>>> challenges. >>>> >>>> >>>> >>>>> Because if we add more analyzers, say EsperantoAnalyzer, if you upgrade >>>>> lucene then suddenly your Esperanto queries are analyzed differently >>>>> (whereas they were dealt with by StandardAnalyzer before). >>>>> >>>>> >>>>> >>>> Yes, presumably the Version would need to be used with this, too. >>>> >>>> >>>> >>>> >>>>> But this becomes less of a problem as we work on modularizing lucene, so >>>>> >>>>> >>>>> >>>> we >>>> >>>> >>>> >>>>> can remove Version from analyzers, >>>>> >>>>> >>>>> >>>> Oh goody, another API change to cope with in my code. >>>> >>>> >>>> >>>> >>>>> and so you can just use an old analyzers >>>>> jar file (such as 4.1) but upgrade your lucene core jar to say version >>>>> >>>>> >>>>> >>>> 4.3. >>>> >>>> >>>> >>>>> >>>>> >>>>> >>>>>> Secondly, if I've got an instance of a SnowballAnalyzer, there's no way >>>>>> to look "inside" it, and see what language it's for. That's a problem >>>>>> on the search side. My QueryParser is a subclass of >>>>>> MultiFieldQueryParser, and it looks for a "special" FieldQuery on the >>>>>> field "_query_language", i.e., "_query_language:de" to tell the query >>>>>> parser to use a German analyzer on this query. What I'd like to be >>>>>> >>>>>> >>>>>> >>>>> able >>>>> >>>> >>>> >>>>> to do is interrogate the current analyzer attached to the query parser >>>>> >>>>>> instance, and throw an exception if it's not for the specified >>>>>> >>>>>> >>>>>> >>>>> language. >>>>> >>>> >>>> >>>>> I can do this for non-Snowball analyzers, because of the brittle >>>>> >>>>>> hand-compiled mapping mentioned above. But if it's a SnowballAnalyzer, >>>>>> there's no way to tell what the language inside it is. So it would be >>>>>> nice if SnowballAnalyzer grew a method >>>>>> >>>>>> >>>>>> >>>>>> >>>>> SnowballAnalyzer had more problems. its actually deprecated in >>>>> trunk/branch_3x and instead there is an Analyzer for each language >>>>> >>>>> >>>>> >>>> (English, >>>> >>>> >>>> >>>>> Italian, etc), which now has stopwords lists, and sometimes special >>>>> >>>>> >>>>> >>>> behavior >>>> >>>> >>>> >>>>> (e.g. Turkish lowercases differently). >>>>> >>>>> Put more simply, its an implementation detail for ItalianAnalyzer that >>>>> we >>>>> implement the stemming with SnowballFilter. One day we might change it >>>>> to >>>>> use a less aggressive stemming algorithm (e.g. ItalianLightStemFilter) >>>>> by >>>>> default. >>>>> >>>>> >>>>> >>>> Ah, good. That will suit my purposes nicely. >>>> >>>> >>>> >>>> >>>>> I'd really like to see the stopword work finished, so that a >>>>> >>>>> >>>>> >>>>>> SnowballAnalyzer for a particular language has a decent set of >>>>>> stopwords. >>>>>> >>>>>> >>>>>> >>>>>> >>>>> See above, I think this is finished? The remaining work is actually Solr >>>>> integration. >>>>> >>>>> >>>>> >>>> Excellent. I looked at the JIRA, but some discussions just seem to >>>> peter out, and I'm having a hard time telling what the resolution is. >>>> >>>> >>>> >>>> >>>>> In trunk and branch_3x, all the analyzers have their own package, here's >>>>> Italian: >>>>> >>>>> Source package: contains Analyzer that uses SnowballFilter(Italian) and >>>>> loads Italian snowball stopwords by default. It also includes an >>>>> alternative, less aggressive stemmer. >>>>> >>>>> >>>>> >>>>> >>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/it/ >>>> >>>> >>>> >>>>> The snowball stopwords were all added to the resources directory. This >>>>> is >>>>> where ItalianAnalyzer loads its set of stopwords from: >>>>> >>>>> >>>>> >>>>> >>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ >>>> >>>> >>>> >>>>> < >>>>> >>>>> >>>>> >>>> http://svn.apache.org/viewvc/lucene/dev/trunk/modules/analysis/common/src/resources/org/apache/lucene/analysis/ >>>> >>>> >>>> >>>>> >>>>> >>>> I see there's also an explicit EnglishAnalyzer -- never thought it made >>>> sense to call that StandardAnalyzer. Great work! >>>> >>>> Bill >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>>> For additional commands, e-mail: java-user-help@lucene.apache.org >>>> >>>> >>>> >>>> >>>> >>> >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org