Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 55605 invoked from network); 20 Jan 2011 21:56:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 20 Jan 2011 21:56:58 -0000 Received: (qmail 16197 invoked by uid 500); 20 Jan 2011 21:56:55 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 16015 invoked by uid 500); 20 Jan 2011 21:56:55 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 16006 invoked by uid 99); 20 Jan 2011 21:56:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jan 2011 21:56:55 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_HELO_PASS,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [212.227.126.171] (HELO moutng.kundenserver.de) (212.227.126.171) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 20 Jan 2011 21:56:48 +0000 Received: from brouwer.fritz.box (srbk-5d807b12.pool.mediaWays.net [93.128.123.18]) by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis) id 0LnFph-1Q9sT50PPh-00hqDq; Thu, 20 Jan 2011 22:56:22 +0100 Subject: Re: Best practices for multiple languages? Mime-Version: 1.0 (Apple Message framework v1082) Content-Type: text/plain; charset=iso-8859-1 From: Paul Libbrecht In-Reply-To: <4D38045F.3060103@eolya.fr> Date: Thu, 20 Jan 2011 22:56:20 +0100 Cc: Bill Janssen Content-Transfer-Encoding: quoted-printable Message-Id: <964C28DD-954B-4A5D-A7F6-3DB07F82A650@hoplahup.net> References: <36966.1295461311@parc.com> <7C4FC074-AF5B-425B-9B69-1AE1CE818B6D@hoplahup.net> <39362.1295466994@parc.com> <43989.1295479776@parc.com> <4D38045F.3060103@eolya.fr> To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1082) X-Provags-ID: V02:K0:vZBqxIrfA3dbax4IacbuRSa+ulF+fWjNEk9vgWl6v2e 3DAJSr6/+pRW73l97/YANryQp8EmiR11UKw3IjB/m22rqOu2rz ByTENs9eLQNlExj4xC1gggM38t0UmLwqjr8BKulke/xpQ2xvvc eQYn9/FEYYH7mG0rW1VB7VRR2CQQBc3MzFyouoAg1jJUabXiL6 Bfqb8/BZPJGnCV/9EOSafHE+fabSwVdI5Ykc+LSVWo= X-Virus-Checked: Checked by ClamAV on apache.org Isn't this approach somewhat bad for term-frequency? Words that would appear in several languages would be a lot more = frequent (hence less significative). I'm still preferring the split-field method with a proper query = expansion. This way, the term-frequency is evaluated on the corpus of one language. Dominique, in your case, at least if on the web, you have: - the user's preferred language (if defined in a profile) - the list of languages the browser says it accepts And that can easily be limited to around 8 so that you cover any = language the user is expecting to search. paul Le 20 janv. 2011 =E0 10:46, Dominique Bejean a =E9crit : > Hi, >=20 > During a recent Solr project we needed to index document in a lot of = languages. The natural solution with Lucene and Solr is to define one = field per languages. Each field is configured in the schema.xml file to = use a language specific processing (tokenizing, stop words, stemmer, = ...). This is really not easy to manage if you have a lot of languages = and this means that 1) the search interface need to know in which = language your are searching 2) the search interface can't search in all = languages at the same time. >=20 > So, I decided that the only solution was to index all languages in = only one field. >=20 > Obviously, each language needs to be processed specifically. For this, = I developped a analyzer that is in charge to redirect content to the = correct tockenizer, filters and stemmer accordingly to its language. = This analyzer is also used at query time. If the user specify the = language of its query, the query is processed by appropriate tockenizer, = filters and stemmer otherwise the query is processed by a defaut = tockenizer, filters and stemmer. >=20 > With this solution : >=20 > 1. I only need one field (or two if I want both stemmed and unstemmed = processing) > 2. The user can search in all document regarless to there language >=20 > I hope this help. >=20 > Dominique > www.zoonix.fr > www.crawl-anywhere.com >=20 >=20 >=20 > Le 20/01/11 00:29, Bill Janssen a =E9crit : >> Paul Libbrecht wrote: >>=20 >>> I did several changes of this sort and the precision and recall >>> measures went better in particular in presence of = language-indication >>> failure which happened to be very common in our authoring = environment. >> There are two kinds of failures: no language, or wrong language. >>=20 >> For no language, I fall back to StandardAnalyzer, so I should have >> results similar to yours. For wrong language, well, I'm using OTS >> trigram-based language guessers, and they're pretty good these days. >>=20 >>>>> Wouldn't it be better to prefer precise matches (a field that is >>>>> analyzed with StandardAnalyzer for example) but also allow matches = are >>>>> stemmed. >> Yes, I think it might improve things, but again, by how much? = Stemming is >> better than no stemming, in terms of recall. But this approach would = also >> improve precision. >>=20 >> Bill >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >>=20 >=20 > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org >=20 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org