Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 874C918DE5 for ; Wed, 2 Mar 2016 17:34:25 +0000 (UTC) Received: (qmail 69195 invoked by uid 500); 2 Mar 2016 17:34:21 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 69128 invoked by uid 500); 2 Mar 2016 17:34:20 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 69111 invoked by uid 99); 2 Mar 2016 17:34:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Mar 2016 17:34:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id D882A1A1257 for ; Wed, 2 Mar 2016 17:34:19 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.078 X-Spam-Level: ** X-Spam-Status: No, score=2.078 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, RCVD_IN_MSPIKE_BL=0.01, RCVD_IN_MSPIKE_L3=2.499, RP_MATCHES_RCVD=-0.329, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=elyograg.org Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id e6M4Qz-nohls for ; Wed, 2 Mar 2016 17:34:17 +0000 (UTC) Received: from frodo.elyograg.org (frodo.elyograg.org [166.70.79.219]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTP id 40B545FBDB for ; Wed, 2 Mar 2016 17:34:16 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by frodo.elyograg.org (Postfix) with ESMTP id BFD52441E for ; Wed, 2 Mar 2016 10:34:07 -0700 (MST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=elyograg.org; h= content-transfer-encoding:content-type:content-type:in-reply-to :mime-version:user-agent:date:date:message-id:from:from :references:subject:subject:received:received; s=mail; t= 1456940047; bh=VwrzvJXplzV2Eh0T/oHG9gINsuy+3ja6dVZv3l8YHI0=; b=R GtEWyIaflGJkk0XFBjkoQxondcPRhGhUF1vGTtlUNmRYCbMlQvQLYULVkvBb7EJM 58K+mXWafpZdO9VqR3PoecSDcoBl7a7IgNNqN8OPl2GOOFbYzkvprgZv0A5E+eBC zaNEH5fE+yC+sp+h8xOW2Ct/OxRSQhlaLwGr1X4IPk= X-Virus-Scanned: Debian amavisd-new at frodo.elyograg.org Received: from frodo.elyograg.org ([127.0.0.1]) by localhost (frodo.elyograg.org [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id ffrg9C88JWYo for ; Wed, 2 Mar 2016 10:34:07 -0700 (MST) Received: from [10.2.0.108] (client175.mainstreamdata.com [209.63.42.175]) (using TLSv1 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: elyograg@elyograg.org) by frodo.elyograg.org (Postfix) with ESMTPSA id 3B626441D for ; Wed, 2 Mar 2016 10:34:07 -0700 (MST) Subject: Re: FW: Difference Between Tokenizer and filter To: solr-user@lucene.apache.org References: <8B9BE879D2A8964E896F0448525CAAEE018D155C@PRD-MSG-EXMB-9.ceb.com> <56D6FB3F.40507@rondhuit.com> <8B9BE879D2A8964E896F0448525CAAEE018D2621@PRD-MSG-EXMB-9.ceb.com> From: Shawn Heisey Message-ID: <56D7240D.4020502@elyograg.org> Date: Wed, 2 Mar 2016 10:34:05 -0700 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <8B9BE879D2A8964E896F0448525CAAEE018D2621@PRD-MSG-EXMB-9.ceb.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 3/2/2016 9:55 AM, G, Rajesh wrote: > Thanks for your email Koji. Can you please explain what is the role of tokenizer and filter so I can understand why I should not have two tokenizer in index and I should have at least one tokenizer in query? You can't have two tokenizers. It's not allowed. The only notable difference between a Tokenizer and a Filter is that a Tokenizer operates on an input that's a single string, turning it into a token stream, and a Filter uses a token stream for both input and output. A CharFilter uses a single string as both input and output. An analysis chain in the Solr schema (whether it's index or query) is composed of zero or more CharFilter entries, exactly one Tokenizer entry, and zero or more Filter entries. Alternately, you can specify an Analyzer class, which is a lot like a Tokenizer. An Analyzer is effectively the same thing as a tokenizer combined with filters. CharFilters run before the Tokenizer, and Filters run after the Tokenizer. CharFilters, Tokenizers, Filters, and Analyzers are Lucene concepts. > My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result The format of the index on disk is not controlled by the tokenizer, or anything else in the analysis chain. It is controlled by the Lucene codec. Only a very small part of the codec is configurable in Solr, but normally this does not need configuring. The codec defaults are appropriate for the majority of use cases. Thanks, Shawn