lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Rowe (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-4619) Improve PreAnalyzedField query analysis
Date Thu, 14 Jan 2016 01:35:39 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097417#comment-15097417
] 

Steve Rowe edited comment on SOLR-4619 at 1/14/16 1:34 AM:
-----------------------------------------------------------

Patch that brings Andrzej's patch up to date with trunk, and adds tests for query-time functionality.

I had assumed that {{PreAnalyzedField}}-s would use the {{PreAnalyzedTokenizer}} at query
time, but that is not (currently) the case: instead {{FieldType.DefaultAnalyzer}} is used.
 This patch changes the behavior when no analyzer is specified to instead use {{PreAnalyzedTokenizer}}.

However, there is a chicken-and-egg interaction between {{PreAnalyzedTokenizer}} and {{QueryBuilder.createFieldQuery()}},
which aborts before performing any tokenization if the supplied analyzer's attribute factory
doesn't contain a {{TermToBytesRefAttribute}}.  But {{PreAnalyzedTokenizer}} doesn't have
any attributes defined until the input stream is consumed, in {{reset()}}. [~rcmuir] added
a comment as part of LUCENE-5388 to {{PreAnalyzedTokenizer}}'s ctor, where {{AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY}}
is set as the attribute factory rather than the default packed implementation: "we don't pack
attributes: since we are used for (de)serialization and dont want bloat."

This patch moves the {{stream.reset()}} call in {{QueryBuilder.createFieldQuery()}} in front
of the {{TermToBytesRefAttribute}} check, so that {{PreAnalyzedTokenizer}} (and other tokenizers
that don't have a pre-added set of attributes) has a chance to populate its attributes, and
also moves the {{addAttribute(PositionIncrementAttribute.class)}} call to after the {{TermToBytesRefAttribute}}
check, since that won't be needed if no tokenization will be performed.

An alternate approach to fix the chicken-and-egg problem might be to have {{PreAnalyzedTokenizer}}
always include a dummy {{TermToBytesRefAttribute}} implementation, and then remove it when
{{reset()}} is called, but that seems hackish.

I haven't run the full tests yet with this patch, but the included query-time {{PreAnalyzedField}}
tests succeed.

I welcome feedback.


was (Author: steve_rowe):
Patch that brings Andrzej's patch up to date with trunk, and adds tests for query-time functionality.

I had assumed that {{PreAnalyzedField}}-s would use the {{PreAnalyzedTokenizer}} at query
time, but that is not (currently) the case: instead {{FieldType.DefaultAnalyzer}} is used.
 This patch changes the behavior when no analyzer is specified to instead use {{PreAnalyzedTokenizer}}.

However, there is a chicken-and-egg interaction between {{PreAnalyzedTokenizer}} and {{QueryBuilder.createFieldQuery()}},
which aborts before performing any tokenization if the supplied analyzer's attribute factory
doesn't contain a {{TermToBytesRefAttribute}}.  But {{PreAnalyzedTokenizer}} doesn't have
any attributes defined until the input stream is consumed, in {{reset()}}. [~rcmuir] added
a comment as part of LUCENE-5388 to {{PreAnalyzedTokenizer}}'s ctor, where {{AttributeFactory.DEFAULT_ATTRIBUTE_FACTORY}}
is set as the attribute factory rather than the default packed implementation: "we don't pack
attributes: since we are used for (de)serialization and dont want bloat."

This patch moves the {{stream.reset()}} call in {{QueryBuilder.createFieldQuery()}} in front
of the {{TermToBytesRefAttribute}} check, so that {{PreAnalyzedTokenizer}} (and other tokenizers
that don't have a pre-added set of attributes) and also moves the {{addAttribute(PositionIncrementAttribute.class)}}
call to after the the {{TermToBytesRefAttribute}} check.

An alternate approach to fix the chicken-and-egg problem might be to have {{PreAnalyzedTokenizer}}
always include a dummy {{TermToBytesRefAttribute}} implementation, and then remove it when
{{reset()}} is called, but that seems hackish.

I haven't run the full tests yet with this patch, but the included query-time {{PreAnalyzedField}}
tests success.

I welcome feedback.

> Improve PreAnalyzedField query analysis
> ---------------------------------------
>
>                 Key: SOLR-4619
>                 URL: https://issues.apache.org/jira/browse/SOLR-4619
>             Project: Solr
>          Issue Type: Bug
>          Components: Schema and Analysis
>    Affects Versions: 4.0, 4.1, 4.2, 4.2.1, Trunk
>            Reporter: Andrzej Bialecki 
>            Assignee: Andrzej Bialecki 
>             Fix For: Trunk
>
>         Attachments: SOLR-4619.patch, SOLR-4619.patch
>
>
> PreAnalyzed field extends plain FieldType and mistakenly uses the DefaultAnalyzer as
query analyzer, and doesn't allow for customization via <analyzer> schema elements.
> Instead it should extend TextField and support all query analysis supported by that type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message