lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: FW: Difference Between Tokenizer and filter
Date Thu, 03 Mar 2016 14:12:24 GMT
Try re-reading the doc on "Understanding Analyzers, Tokenizers, and
Filters" and then ask specific questions on specific statements made in the
doc:
https://cwiki.apache.org/confluence/display/solr/Understanding+Analyzers,+Tokenizers,+and+Filters

As far as on-disk format, a Solr user has absolutely zero reason to be
concerned about what format Lucene uses to store the index on disk. You are
certainly welcome to dive down to that level if you wish, but that is not
something worth discussing on this list. To a Solr user the index is simply
a list of terms at positions, both determined by the character filters,
tokenizer, and token filters of the analyzer. The format of that
information as stored in Lucene won't impact the behavior of your Solr app
in any way.

Again, to be clear, you need to be thoroughly familiar with that doc
section. It won't help you to try to guess questions to ask if you don't
have a basic understanding of what is stated on that doc page.

It might also help you visualize what the doc says by using the analysis
page of the Solr admin UI which will give you all the intermediate and
final results of the analysis process, the specific token/term text and
position at each step. But even that won't help if you are unable to grasp
what is stated on the basic doc page.

-- Jack Krupansky

On Thu, Mar 3, 2016 at 8:51 AM, G, Rajesh <rg@cebglobal.com> wrote:

> Hi Shawn,
>
> One last question on analyzer. If the format of the index on disk is not
> controlled by the tokenizer, or anything else in the analysis chain, then
> what does type="index" and type="query" in analyzer mean. Can you please
> help me in understanding?
>
>         <analyzer type="index">
>
>          </analyzer>
>          <analyzer type="query">
>
>          </analyzer>
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
> -----Original Message-----
> From: G, Rajesh
> Sent: Thursday, March 3, 2016 6:12 PM
> To: 'solr-user@lucene.apache.org' <solr-user@lucene.apache.org>
> Subject: RE: FW: Difference Between Tokenizer and filter
>
> Thanks Shawn. This helps
>
> -----Original Message-----
> From: Shawn Heisey [mailto:apache@elyograg.org]
> Sent: Wednesday, March 2, 2016 11:04 PM
> To: solr-user@lucene.apache.org
> Subject: Re: FW: Difference Between Tokenizer and filter
>
> On 3/2/2016 9:55 AM, G, Rajesh wrote:
> > Thanks for your email Koji. Can you please explain what is the role of
> tokenizer and filter so I can understand why I should not have two
> tokenizer in index and I should have at least one tokenizer in query?
>
> You can't have two tokenizers.  It's not allowed.
>
> The only notable difference between a Tokenizer and a Filter is that a
> Tokenizer operates on an input that's a single string, turning it into a
> token stream, and a Filter uses a token stream for both input and output.
> A CharFilter uses a single string as both input and output.
>
> An analysis chain in the Solr schema (whether it's index or query) is
> composed of zero or more CharFilter entries, exactly one Tokenizer entry,
> and zero or more Filter entries.  Alternately, you can specify an Analyzer
> class, which is a lot like a Tokenizer.  An Analyzer is effectively the
> same thing as a tokenizer combined with filters.
>
> CharFilters run before the Tokenizer, and Filters run after the
> Tokenizer.  CharFilters, Tokenizers, Filters, and Analyzers are Lucene
> concepts.
>
> > My understanding is tokenizer is used to say how the content should be
> > indexed physically in file system. Filters are used to query result
>
> The format of the index on disk is not controlled by the tokenizer, or
> anything else in the analysis chain.  It is controlled by the Lucene
> codec.  Only a very small part of the codec is configurable in Solr, but
> normally this does not need configuring.  The codec defaults are
> appropriate for the majority of use cases.
>
> Thanks,
> Shawn
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message