lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Smiley (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8323) New ConcatenateFilter, a TokenFilter to concat/join tokens
Date Thu, 24 May 2018 19:14:00 GMT


David Smiley commented on LUCENE-8323:

Thanks for the review Adrien.

bq. Did you use those license headers on purpose for new files? They don't look like the usual
ones that we use.

It's deliberate; this comes from another project, remember.  (see my 1st comment).

I agree with all your other points but it may be moot now based my discovery of CompletionTokenStream...

bq. You can also check the CompletionTokenStream in the suggest package. It does exactly what
you want and it's already a TokenStream so maybe it can be renamed and moved to the analysis
module ?

Wow thanks Jim; this is exactly what I'm looking for!

+1 to move make move/rename CompletionTokenStream for broader use.

I think it should be made a TokenFilter so that it can be used easily with, say, CustomAnalyzer.
 I did this as a quick hack and it's mostly okay.  I had to debug some various tokenstream
lifecycle stuff though that wasn't so much because it's a Filter and more due to with it getting
tested in a more hardened way thanks to BaseTokenStreamTestCase.

What name?  CompletionGraphTokenFilter maybe but the word "Completion" is tied too much to
it's original use-case.  Maybe ConcatenateGraphTokenFilter or shorter ConcatGraphTokenFilter?
 FiniteStringsGraphTokenFilter is another idea though it's name seems very non-obvious to
all but internal Lucene devs.

I think we should add "@see" references between GraphTokenStreamFiniteStrings and CompletionTokenStream
as these things do very similar things.  It appears TokenStreamToAutomaton (used by CompletionTokenStream)
and is a duplicated algorithm... they could be reused
maybe.  But I didn't look closely to see.

I just did a quick hack experiment of using CompletionTokenStream in place of ConcatenateFilter
with the SolrTextTagger tests and it basically works.  I mentioned some lifecycle stuff above
I debugged.  I needed to make the separator customizable (e.g. to be a space).  One weird
thing is that the first position increment of CompletionTokenStream is 0 which IndexWriter
is unhappy with so I set it to 1.  Interestingly, BaseTokenStreamTestCase didn't complain
about this yet real world use complained right away.  Maybe BaseTokenStreamTestCase needs
to explicitly test this?

I'll throw up a patch once I get confirmation on a name.

> New ConcatenateFilter, a TokenFilter to concat/join tokens
> ----------------------------------------------------------
>                 Key: LUCENE-8323
>                 URL:
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: David Smiley
>            Assignee: David Smiley
>            Priority: Major
>         Attachments: LUCENE-8323.patch
> Here I introduce the ConcatenateFilter (with Factory) to concatenate/join tokens with
a provided separator to produce one final token.  It's similar to FingerprintFilter but doesn't
deduplicate or sort.  It's useful for doing exact-ish search on short text (think names or
titles) with simple analysis.  At this task, its faster than a PhraseQuery equivalent, and
solves the issue of matching completely and not a portion of the tokens.  It's also useful
for using Lucene to hold a dictionary of short names/phrases for entity-extraction (aka text
tagging).  The OpenSextant SolrTextTagger uses it for this purpose, which is where I'm taking
it from.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message