Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter
To: solr-user@lucene.apache.org
References: <8B9BE879D2A8964E896F0448525CAAEE018D155C@PRD-MSG-EXMB-9.ceb.com>
 <56D6FB3F.40507@rondhuit.com>
 <8B9BE879D2A8964E896F0448525CAAEE018D2621@PRD-MSG-EXMB-9.ceb.com>
From: Shawn Heisey <apache@elyograg.org>
Message-ID: <56D7240D.4020502@elyograg.org>
Date: Wed, 2 Mar 2016 10:34:05 -0700
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101
 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: <8B9BE879D2A8964E896F0448525CAAEE018D2621@PRD-MSG-EXMB-9.ceb.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of tokenizer and filter so I can understand why I should not have two tokenizer in index and I should have at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a
Tokenizer operates on an input that's a single string, turning it into a
token stream, and a Filter uses a token stream for both input and
output.  A CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is
composed of zero or more CharFilter entries, exactly one Tokenizer
entry, and zero or more Filter entries.  Alternately, you can specify an
Analyzer class, which is a lot like a Tokenizer.  An Analyzer is
effectively the same thing as a tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the
Tokenizer.  CharFilters, Tokenizers, Filters, and Analyzers are Lucene
concepts.

> My understanding is tokenizer is used to say how the content should be indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or
anything else in the analysis chain.  It is controlled by the Lucene
codec.  Only a very small part of the codec is configurable in Solr, but
normally this does not need configuring.  The codec defaults are
appropriate for the majority of use cases.

Thanks,
Shawn