Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 468F418A34 for ; Wed, 23 Sep 2015 15:05:05 +0000 (UTC) Received: (qmail 41247 invoked by uid 500); 23 Sep 2015 15:05:01 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 41174 invoked by uid 500); 23 Sep 2015 15:05:01 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 41162 invoked by uid 99); 23 Sep 2015 15:05:01 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 23 Sep 2015 15:05:01 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id D3680C145D for ; Wed, 23 Sep 2015 15:05:00 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.898 X-Spam-Level: ** X-Spam-Status: No, score=2.898 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id SOtxJt4N3PS7 for ; Wed, 23 Sep 2015 15:05:00 +0000 (UTC) Received: from mail-io0-f176.google.com (mail-io0-f176.google.com [209.85.223.176]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 95FBA44189 for ; Wed, 23 Sep 2015 15:04:59 +0000 (UTC) Received: by iofb144 with SMTP id b144so46526297iof.1 for ; Wed, 23 Sep 2015 08:04:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=s7KN/fGyxXcdO5gAioag4PxbhtRmX9n8qw+XFJv5hjw=; b=h24CXGMsFX0nVKjM4lHWXw+24+uDgW2JTplLKfWrE530oTpU28E2AlqDA0SkF5z1te NJYHG2XsnjL7Wze25lelfb9geas/U3uZkUqUi5c+l9obIZkpx9nrEhyuqSCMt7sghvfX 1Cx3Ya2+XFtxrE2DlcufcQ7v//mDZ8/gpaEHIwwjQOiaCDhqbPkV9KUxTkWoXnlt07DD b/j8eOCh7Zrq3IO/hYk6djNFFuq9mYZ3oAesHegYsETSl0rEAYgCtRkjvPFx+CWb0fNJ m/juI804z4Zm2nE+15icNtgraNl7eMdJJ0zO5I+XBNUtgMZ6Sb+8LP0LbrLWqk+Us0nm 24jA== MIME-Version: 1.0 X-Received: by 10.107.7.25 with SMTP id 25mr45930480ioh.171.1443020692926; Wed, 23 Sep 2015 08:04:52 -0700 (PDT) Received: by 10.107.9.93 with HTTP; Wed, 23 Sep 2015 08:04:52 -0700 (PDT) In-Reply-To: References: Date: Wed, 23 Sep 2015 08:04:52 -0700 Message-ID: Subject: Re: Can StandardTokenizerFactory works well for Chinese and English (Bilingual)? From: Erick Erickson To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=001a113fc432ce35ad05206b6f04 --001a113fc432ce35ad05206b6f04 Content-Type: text/plain; charset=UTF-8 In a word, no. The CJK languages in general don't necessarily tokenize on whitespace so using a tokenizer that uses whitespace as it's default tokenizer simply won't work. Have you tried it? It seems a simple test would get you an answer faster. Best, Erick On Wed, Sep 23, 2015 at 7:41 AM, Zheng Lin Edwin Yeo wrote: > Hi, > > Would like to check, will StandardTokenizerFactory works well for indexing > both English and Chinese (Bilingual) documents, or do we need tokenizers > that are customised for chinese (Eg: HMMChineseTokenizerFactory)? > > > Regards, > Edwin > --001a113fc432ce35ad05206b6f04--