Return-Path: X-Original-To: apmail-lucene-java-user-archive@www.apache.org Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B249B1107C for ; Sun, 6 Apr 2014 14:37:56 +0000 (UTC) Received: (qmail 32503 invoked by uid 500); 6 Apr 2014 14:37:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 31944 invoked by uid 500); 6 Apr 2014 14:37:53 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 31934 invoked by uid 99); 6 Apr 2014 14:37:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Apr 2014 14:37:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of benson@basistech.com designates 209.85.219.52 as permitted sender) Received: from [209.85.219.52] (HELO mail-oa0-f52.google.com) (209.85.219.52) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Apr 2014 14:37:48 +0000 Received: by mail-oa0-f52.google.com with SMTP id l6so5508903oag.39 for ; Sun, 06 Apr 2014 07:37:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=basistech.com; s=google; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=1Kb+p3NXjP5RvNLoL1kjsAIeIpn7Q9GofVaCcNzv5t0=; b=nJ/pnjf27wpPnEfq/oWLhwPMQCn1QrsGPI1NhOFEBfBwmWzPcgjDxut3aEy5m9gI1J L+IQ2qOCvnuQ8xN97sNWreAXcYPoWw/l7tZZB9CIDPyD2M4mJ6G7wIaKAdWHmcIZoKDZ PmLqBjYaT/u8jU8tKS1eNVS4562+BCnDVE1Ug= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=1Kb+p3NXjP5RvNLoL1kjsAIeIpn7Q9GofVaCcNzv5t0=; b=Lbwg2bOVn23fcMGJM3O2Rjz8l0L0bCOgAAuf52aVoWEpTZGEZkklbYp+R7zbFWlDhg p/vrjmAhzC+FG4H4R4olobBETa1UNSsOKZEoHuVto3WrZNPRTu4WUqGS/moLhle2Zazo lN1cW5Z22Tc148+HjrXRLTjV9O5vLxi9WlZTlWB7ghPTMLSxtCdUa7g5Q7LMqjYemE/X LIIzEh6s+i4mTJfOzItpjfBvAgt8SZuwvm93kFDW+QK1bbUAPoFSNEqqGDAL+xQ2Ld1e uhQ4yNMfj+nk9AlDDH9aUG/mzLRaLflfGjHwY3iiCXWIEWgwQjHKFaojHFQ5/8dfTW5I uKOQ== X-Gm-Message-State: ALoCoQkrJ0wCEAmQTHhc0wDx8/WU/vwNWWuy2dSavx/1vqdJbjMO36woYfiIQk1bUzdaeDijOaM4 MIME-Version: 1.0 X-Received: by 10.182.55.3 with SMTP id n3mr1191914obp.55.1396795047211; Sun, 06 Apr 2014 07:37:27 -0700 (PDT) Received: by 10.76.104.138 with HTTP; Sun, 6 Apr 2014 07:37:27 -0700 (PDT) In-Reply-To: <534164ED.6090702@orcatec.com> References: <20140406105940290168.13b5@1c6f6534dcfc> <534164ED.6090702@orcatec.com> Date: Sun, 6 Apr 2014 10:37:27 -0400 Message-ID: Subject: Re: Confuse with Kuromoji From: Benson Margulies To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary=089e015387029d243904f660b0b0 X-Virus-Checked: Checked by ClamAV on apache.org --089e015387029d243904f660b0b0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On Sun, Apr 6, 2014 at 10:30 AM, Herb Roitblat w= rote: > Just curious, what are some of the things that people do to properly > tokenize the queries with mixed language collections? What do you do wit= h > mixed language queries? > You can either force the user to tell you the language, or ... you can run a language detector. They are less accurate for short strings, or ... you can process it in _all_ of the languages and OR up the results. > > On 4/6/2014 4:51 AM, Benson Margulies wrote: > >> You must know what language each text is in, and use an appropriate >> analyzer. Some people do this by using a separate field (text_eng, >> text_spa, text_jpn). Other people put some extra information at the >> beginning of the field, and then make an analyzer that peeks in order to >> dispatch to the correct tokenizer. >> >> >> On Sat, Apr 5, 2014 at 9:59 PM, wrote: >> >> I am pretty new with Lucene, however I have not problem understanding >>> what >>> is about. >>> My big problem is trying to understand how Kuromoji works. I need to >>> implement a search functinality thats supports initially English, Spani= sh >>> and Japanese. I doesn't seem to be a deal with the two firsts, as I can >>> just use the analyzers=E3=83=BCcommon to index both languages contents,= but when >>> it >>> comes to Japanese it has it's own analyzer. I could't find any clues >>> about >>> combining analyzers, so I still don't if I can combine all languages >>> under >>> the same index (which would be ideal, as I expect mix searches in the >>> context of my project) or I have to detect the language first and then >>> index Japanese texts separately (what it will be a big disadvantage whe= n >>> it >>> comes to mixed searches and future localization expansion). >>> I found out about Lucene throgh Kuromoji, it will be great to find out = a >>> solution to be able to use all the greatness that Lucene offers. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --089e015387029d243904f660b0b0--