lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Searching number of tokens in text field
Date Mon, 30 Dec 2019 13:07:15 GMT
This comes up occasionally, it’d be a neat thing to add to Solr if you’re motivated. It
gets tricky though.

- part of the config would have to be the name of the length field to put the result into,
that part’s easy.

- The trickier part is “when should the count be incremented?”. For instance, say you
add 15 synonyms for a particular word. Would that add 1 or 16 to the count? What about WordDelimiterGraphFilterFactory,
that can output N tokens in place of one. Do stopwords count? What about shingles? CJK languages?
The list goes on.

If you tackle this I suggest you open a JIRA for discussion, probably a Lucene JIRA ‘cause
the folks who deal with Lucene would have the best feedback. And probably ignore most of the
possible interactions with other filters and document that most users should just put it immediately
after the tokenizer and leave it at that ;)

I can think of a few other options, but about the only thing that I think makes sense is something
like “countTokensInTheSamePosition=true|false” (there’s _GOT_ to be a better name for
that!), defaulting to false so you could control whether synonym expansion and WDGFF insertions
incremented the count or not. And I suspect that if you put such a filter after WDGFF, you’d
also want to document that it should go after FlattenGraphFilterFactory, but trust any feedback
on a Lucene JIRA over my suspicion...

Best,
Erick

> On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics411@gmail.com> wrote:
> 
> That is a clever idea.  I would still prefer something cleaner but this
> could work.  Thanks!
> 
> On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com> wrote:
> 
>> I don't know of any pre-existing thing that does exactly this, but how
>> about a token filter that counts tokens (or positions maybe), and then
>> appends some special token encoding the length?
>> 
>> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com> wrote:
>> 
>>> Hello,
>>> 
>>> I was wondering if it is possible to search for the number of tokens in a
>>> text field.  For example find book titles with 3 or more words.  I don't
>>> mind adding a field that is the number of tokens to the search index but
>> I
>>> would like to avoid analyzing the text two times.   Can Lucene search for
>>> the number of tokens in a text field?  Or can I get the number of tokens
>>> after analysis and add it to the Lucene document before/during indexing?
>>> Or do I need to analysis the text myself and add the field to the
>> document
>>> (analyze the text twice, once myself, once in the IndexWriter).
>>> 
>>> Thanks,
>>> Matt Davis
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message