lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Sullivan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-7136) Add an AutoPhrasing TokenFilter
Date Thu, 26 Feb 2015 05:02:05 GMT

    [ https://issues.apache.org/jira/browse/SOLR-7136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337873#comment-14337873
] 

Ted Sullivan edited comment on SOLR-7136 at 2/26/15 5:01 AM:
-------------------------------------------------------------

[~otis] I am looking at this now - I have not tested/compared these solutions yet. I will
definitely do that. The strategy of [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
is a clever one - it forces tokenization of phrases that are in synonyms.txt by detecting
internal whitespace and then either forces PhraseQuery logic or automatic quoting when building
the Lucene Query (using TypeAttributes). In that sense, the two ideas are similar.

The autophrasing token filter solves a slightly different problem than does [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
in that it does not require that a term be listed as a synonym of something else to get the
correct semantic tokenization. Simply reducing false positives due to partial hits on a phrase
can be a large improvement in precision and not everything has an obvious synonym. Therefore,
it has value even if [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is also committed
 (or patched). Another difference is that [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
one of the patches uses PhraseQuery whereas the solution of combining autophrasing with synonym
mapping does not. How much of a performance difference this might entail I can't say - probably
not a great deal unless we are talking about very large queries. The auto quoting parser patches
work in a similar fashion to the AutoPhrasingQParserPlugin as a workaround to [LUCENE-2605|https://issues.apache.org/jira/browse/LUCENE-2605].

The autophrasing multi-term synonym solution does have the disadvantage of requiring coupling
between the autophrases.txt and synonyms.txt, which the other solution does not. But that
said, the other solution does not deal with multi-word terms that do not have synonyms (I
suppose that you could create a dummy synonym but that would be difficult to maintain).

To answer your question about a 'superset' - yes if you consider that the solutions for multi-term
synonym mapping would be equivalent. All in all, I would say that both solutions are useful
and would add useful functionality to the available Solr toolset. Dealing with multi-word
terms is a problem that many Solr deployments have and it is one that remains unresolved.

 I think that the query parser solution in [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
is better as it solves the problem in a general way. To get non-synonymous phrases into this
may require some tweaking to get the TypeAttribute to match up. I wouldn't use - typeAttribute.type().equals("SYNONYM")
maybe typeAttribuyte.type() should be "PHRASE".


was (Author: tedsullivan):
[~otis] I am looking at this now - I have not tested/compared these solutions yet. I will
definitely do that. The strategy of [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
is a clever one - it forces tokenization of phrases that are in synonyms.txt by detecting
internal whitespace and then either forces PhraseQuery logic or automatic quoting when building
the Lucene Query (using TypeAttributes). 

The autophrasing token filter solves a slightly different problem than does [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
in that it does not require that a term be listed as a synonym of something else to get the
correct semantic tokenization. Simply reducing false positives due to partial hits on a phrase
can be a large improvement in precision and not everything has an obvious synonym. Therefore,
it has value even if [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379] is also committed
 (or patched). Another difference is that [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
one of the patches uses PhraseQuery whereas the solution of combining autophrasing with synonym
mapping does not. How much of a performance difference this might entail I can't say - probably
not a great deal unless we are talking about very large queries. The auto quoting parser patches
work in a similar fashion to the AutoPhrasingQParserPlugin as a workaround to [LUCENE-2605|https://issues.apache.org/jira/browse/LUCENE-2605].

The autophrasing multi-term synonym solution does have the disadvantage of requiring coupling
between the autophrases.txt and synonyms.txt, which the other solution does not. But that
said, the other solution does not deal with multi-word terms that do not have synonyms (I
suppose that you could create a dummy synonym but that would be difficult to maintain).

To answer your question about a 'superset' - yes if you consider that the solutions for multi-term
synonym mapping would be equivalent. All in all, I would say that both solutions are useful
and would add useful functionality to the available Solr toolset. Dealing with multi-word
terms is a problem that many Solr deployments have and it is one that remains unresolved.

 I think that the query parser solution in [SOLR-5379|https://issues.apache.org/jira/browse/SOLR-5379]
is better as it solves the problem in a general way. To get non-synonymous phrases into this
may require some tweaking to get the TypeAttribute to match up. I wouldn't use - typeAttribute.type().equals("SYNONYM")
maybe typeAttribuyte.type() should be "PHRASE".

> Add an AutoPhrasing TokenFilter
> -------------------------------
>
>                 Key: SOLR-7136
>                 URL: https://issues.apache.org/jira/browse/SOLR-7136
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Ted Sullivan
>         Attachments: SOLR-7136.patch, SOLR-7136.patch
>
>
> Adds an 'autophrasing' token filter which is designed to enable noun phrases that represent
a single entity to be tokenized in a singular fashion. Adds support for ManagedResources and
Query parser auto-phrasing support given LUCENE-2605.
> The rationale for this Token Filter and its use in solving the long standing multi-term
synonym problem in Lucene Solr has been documented online. 
> http://lucidworks.com/blog/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message