Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 9162 invoked from network); 14 Mar 2011 13:50:04 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 14 Mar 2011 13:50:04 -0000 Received: (qmail 76647 invoked by uid 500); 14 Mar 2011 13:50:01 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 76604 invoked by uid 500); 14 Mar 2011 13:50:01 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 76596 invoked by uid 99); 14 Mar 2011 13:50:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Mar 2011 13:50:01 +0000 X-ASF-Spam-Status: No, hits=0.7 required=5.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [193.197.136.104] (HELO mta-out-vm.ph-freiburg.de) (193.197.136.104) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Mar 2011 13:49:54 +0000 Received: from localhost (localhost [127.0.0.1]) by mta-out-vm.ph-freiburg.de (Postfix) with ESMTP id B9AFB22354; Mon, 14 Mar 2011 14:50:29 +0100 (CET) X-Virus-Scanned: amavisd-new at ph-bw.de Received: from mta-out-vm.ph-freiburg.de ([127.0.0.1]) by localhost (mail-out-vm.ph-freiburg.de [127.0.0.1]) (amavisd-new, port 10024) with LMTP id bsXpwlhn7ZXQ; Mon, 14 Mar 2011 14:50:29 +0100 (CET) Received: from frlpms01.ph-bw.de (frlpms01.ph-bw.de [193.197.136.97]) by mta-out-vm.ph-freiburg.de (Postfix) with ESMTP id 5F5F222347; Mon, 14 Mar 2011 14:50:29 +0100 (CET) Received: from ip-2-205-75-8.web.vodafone.de (ip-2-205-75-8.web.vodafone.de [2.205.75.8]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (Client did not present a certificate) (Authenticated sender: libbrechtka) by frlpms01.ph-bw.de (Postfix) with ESMTPSA id A69172221D; Mon, 14 Mar 2011 14:49:31 +0100 (CET) Subject: Re: Indexing of multilingual labels Mime-Version: 1.0 (Apple Message framework v1082) Content-Type: text/plain; charset=iso-8859-1 From: Paul Libbrecht In-Reply-To: Date: Mon, 14 Mar 2011 14:49:26 +0100 Cc: Erick Erickson Content-Transfer-Encoding: quoted-printable Message-Id: <4120CCC9-340D-46C9-A8C9-E48CC32C860A@hoplahup.net> References: To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.1082) X-Virus-Checked: Checked by ClamAV on apache.org Stephane, I think that you have the freedom to put what you want in the stored = value of a field. The simplest would even be to make it that the fields that you want to = use for display are stored, preformatted, xml-ished, owl-ified, or = json-ized, to be separate from the indexed fields (where you are only = interested to the plain text).=20 Payloads seem to be doing a similar job as a separate stored, = non-indexed field. The best approach I had thus far was to use a multiplexing analyzer = (which is called for indexed fields only anyways) that recognizes the = language by the suffix of the field name. As to the difference between one index and several fields or one field = in many indices, I think it is just a programming difference. The tf and = idf are always done at the term level so they make no difference.=20 I tend to prefer multiple fields because it's easier to expand a query = for, say, Fourrier sent by a browser that says English but also accepts = french and German into: - a query for Fourrier in the whitespace-tokenized track (always prefer = that one) - a query for fouri in the French field - a query for fourier in the English and German fields My current experience is that many users appear or claim to speak many = languages (they do, a little bit). hope it helps. paul PS: not that my code is ideal but here are the ones I have: - i2geo, based on an ontology of concepts in OWL,=20 http://i2geo.net/xwiki/bin/view/About/GeoSkills and http://svn.activemath.org/intergeo/Platform/SearchI2G/ - ActiveMath, fed by XML, = http://www.activemath.org/javadoc/org/activemath/omdocjdom/index/package-s= ummary.html and=20 Le 11 mars 2011 =E0 16:35, Stephane Fellah a =E9crit : > Erick, >=20 > I am trying to index multilingual taxonomies such as SKOS, Wordnet, > Eurowordnet. Taxonomies are composed of concepts which have preferred = and > alternative labels in different languages. Some labels are the same = lexical > form in different languages. I want to be able to index these concepts = in > Lucene in order to be able to search concepts by their label in one or > several languages. I want also be able to display concept definition = with > all the alternative labels in different languages. My question is: = could we > use the payload mechanism to store the language assigned to the word = (i read > somewhere Google was using payload to store information such as font = for > example, so why not language) ? Wouldn't be a better approach then = using one > field per language or one index per language ? >=20 > REgards > Stephane >=20 > On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson = wrote: >=20 >> It's not so much a matter of problems with indexing/searching >> as it is with search behavior. The reason these strategies >> are implemented is that using English stemming, say, on >> other languages will produce "interesting" results. >>=20 >> There's no a-priori reason you can't index multiple languages >> in the same field. >>=20 >> So I don't see what you would accomplish by using payloads >> to indicate which language the term is in. Could you expand >> a bit on what you're trying to accomplish here? Maybe there >> are better solutions.... >>=20 >> Best >> Erick >>=20 >>=20 >> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah >> wrote: >>> I am trying to index in Lucene a field that could have label of = concepts >> in >>> different languages. Most of the approaches I have seen so far are: >>>=20 >>> - >>>=20 >>> Use a single index, where each document has a field per each = language >> it >>> uses, or >>> - >>>=20 >>> Use M indexes, M being the number of languages in the corpus. >>>=20 >>> Lucene 2.9+ has a feature called Payload that allows to attach = attributes >> to >>> term. Is anyone use this mechanism to store language (or other = attributes >>> such as datatypes) information ? Does this approach if labels are = the >> same >>> in different languages (does it break inverted index) ? How is >> performance >>> compared to the two other approaches ? Any pointer on source code = showing >>> how it is done would help. >>>=20 >>> Thanks >>>=20 >>> -- >>> Stephane Fellah, M.Sc, B.Sc >>> Principal Engineer/Product Manager >>> smartRealm LLC >>> 201 Loudoun St. SW >>> Leesburg, VA 20175 >>> Tel: 703 669 5514 >>> Cell: 571 502 8478 >>> Fax: 703 669 5515 >>>=20 >>=20 >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >>=20 >>=20 >=20 >=20 > --=20 > Stephane Fellah, M.Sc, B.Sc > Principal Engineer/Product Manager > smartRealm LLC > 201 Loudoun St. SW > Leesburg, VA 20175 > Tel: 703 669 5514 > Cell: 571 502 8478 > Fax: 703 669 5515 --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org