lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8332) New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
Date Mon, 28 May 2018 19:51:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492904#comment-16492904
] 

David Smiley commented on LUCENE-8332:
--------------------------------------

Oh I wanted to mention one thing; perhaps just here though I could put in the docs.

An alternative approach to this tagger might be to use the SynonymGraphFilter (with other
steps/configuration),
 which has a lot of similarities with the Tagger's algorithm.  I've heard of others that
have done this (Dice.com?), and before I created the tagger I thought about this approach
too.  There are some issues/barriers to "just" using the synonym filter::
* if the filter finds multiple overlapping matches, it only returns one without any control
over its choice.  (compare to the STT's "overlaps" param with several choices and it's pluggable)
* the filter doesn't hold any metadata; it's just a set of names.  Though you could use synonyms
to map to an ID that you then lookup in something else (e.g. some DB or Solr index).
* the synonym filter must re-construct its FST on startup each time; customizations are necessary
to load an existing one from disk.
* you have to arrange for any text processing/analysis (e.g. tokenization rules or phonetic
filters) of the dictionary to create synonym entries.  With the STT this is all configurable
in a standard way like any text field.
* and of course you'd have to glue it all together somehow.

> New ConcatenateGraphTokenStream (move/rename CompletionTokenStream)
> -------------------------------------------------------------------
>
>                 Key: LUCENE-8332
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8332
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Lets move and rename the CompletionTokenStream in the suggest module into the analysis
module renamed as ConcatenateGraphTokenStream. See comments in LUCENE-8323 leading to this
idea. Such a TokenStream (or TokenFilter?) has several uses:
>  * for the suggest module
>  * by the SolrTextTagger for NER/ERD use cases – SOLR-12376
>  * for doing complete match search efficiently
> It will need a factory – a TokenFilterFactory, even though we don't have a TokenFilter
based subclass of TokenStream.
> It appears there is no back-compat concern in it suddenly disappearing from the suggest
module as it's marked experimental and it only seems to be public now perhaps due to some
technicality (it has package level constructors).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message