lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "G, Rajesh" ...@cebglobal.com>
Subject RE: FW: Difference Between Tokenizer and filter
Date Thu, 03 Mar 2016 12:42:11 GMT
Thanks Shawn. This helps



Corporate Executive Board India Private Limited. Registration No: U741040HR2004PTC035324.
Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002,
India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may
contain confidential and legally privileged information belonging to CEB and/or its subsidiaries,
including CEB subsidiaries that offer SHL Talent Measurement products and services. If you
have received this e-mail in error, please notify the sender and immediately, destroy all
copies of this email and its attachments. The publication, copying, in whole or in part, or
use or dissemination in any other way of this e-mail and attachments by anyone other than
the intended person(s) is prohibited.

-----Original Message-----
From: Shawn Heisey [mailto:apache@elyograg.org]
Sent: Wednesday, March 2, 2016 11:04 PM
To: solr-user@lucene.apache.org
Subject: Re: FW: Difference Between Tokenizer and filter

On 3/2/2016 9:55 AM, G, Rajesh wrote:
> Thanks for your email Koji. Can you please explain what is the role of tokenizer and
filter so I can understand why I should not have two tokenizer in index and I should have
at least one tokenizer in query?

You can't have two tokenizers.  It's not allowed.

The only notable difference between a Tokenizer and a Filter is that a Tokenizer operates
on an input that's a single string, turning it into a token stream, and a Filter uses a token
stream for both input and output.  A CharFilter uses a single string as both input and output.

An analysis chain in the Solr schema (whether it's index or query) is composed of zero or
more CharFilter entries, exactly one Tokenizer entry, and zero or more Filter entries.  Alternately,
you can specify an Analyzer class, which is a lot like a Tokenizer.  An Analyzer is effectively
the same thing as a tokenizer combined with filters.

CharFilters run before the Tokenizer, and Filters run after the Tokenizer.  CharFilters, Tokenizers,
Filters, and Analyzers are Lucene concepts.

> My understanding is tokenizer is used to say how the content should be
> indexed physically in file system. Filters are used to query result

The format of the index on disk is not controlled by the tokenizer, or anything else in the
analysis chain.  It is controlled by the Lucene codec.  Only a very small part of the codec
is configurable in Solr, but normally this does not need configuring.  The codec defaults
are appropriate for the majority of use cases.

Thanks,
Shawn

Mime
View raw message