lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matt Davis <kryptonics...@gmail.com>
Subject Re: Searching number of tokens in text field
Date Thu, 02 Jan 2020 15:58:35 GMT
Thanks Mike that is very helpful.  Am I reading the code correctly that the
norm lossy encoding is done in the similarity?  How do you set the number
of bytes used for the norms?

Thanks,
Matt

On Thu, Jan 2, 2020 at 10:31 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Norms encode the number of tokens in the field, but in a lossy manner (1
> byte by default), so you could probably create a custom query that filtered
> based on that, if you could tolerate the loss in precision?  Or maybe
> change your norms storage to more precision?
>
> You could use NormsFieldExistsQuery as a starting point for the sources for
> your custom query.  Or maybe there's already a more similar Query based on
> norms?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > This comes up occasionally, it’d be a neat thing to add to Solr if you’re
> > motivated. It gets tricky though.
> >
> > - part of the config would have to be the name of the length field to put
> > the result into, that part’s easy.
> >
> > - The trickier part is “when should the count be incremented?”. For
> > instance, say you add 15 synonyms for a particular word. Would that add 1
> > or 16 to the count? What about WordDelimiterGraphFilterFactory, that can
> > output N tokens in place of one. Do stopwords count? What about shingles?
> > CJK languages? The list goes on.
> >
> > If you tackle this I suggest you open a JIRA for discussion, probably a
> > Lucene JIRA ‘cause the folks who deal with Lucene would have the best
> > feedback. And probably ignore most of the possible interactions with
> other
> > filters and document that most users should just put it immediately after
> > the tokenizer and leave it at that ;)
> >
> > I can think of a few other options, but about the only thing that I think
> > makes sense is something like “countTokensInTheSamePosition=true|false”
> > (there’s _GOT_ to be a better name for that!), defaulting to false so you
> > could control whether synonym expansion and WDGFF insertions incremented
> > the count or not. And I suspect that if you put such a filter after
> WDGFF,
> > you’d also want to document that it should go after
> > FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over
> my
> > suspicion...
> >
> > Best,
> > Erick
> >
> > > On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics411@gmail.com>
> wrote:
> > >
> > > That is a clever idea.  I would still prefer something cleaner but this
> > > could work.  Thanks!
> > >
> > > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com>
> > wrote:
> > >
> > >> I don't know of any pre-existing thing that does exactly this, but how
> > >> about a token filter that counts tokens (or positions maybe), and then
> > >> appends some special token encoding the length?
> > >>
> > >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com>
> > wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I was wondering if it is possible to search for the number of tokens
> > in a
> > >>> text field.  For example find book titles with 3 or more words.  I
> > don't
> > >>> mind adding a field that is the number of tokens to the search index
> > but
> > >> I
> > >>> would like to avoid analyzing the text two times.   Can Lucene search
> > for
> > >>> the number of tokens in a text field?  Or can I get the number of
> > tokens
> > >>> after analysis and add it to the Lucene document before/during
> > indexing?
> > >>> Or do I need to analysis the text myself and add the field to the
> > >> document
> > >>> (analyze the text twice, once myself, once in the IndexWriter).
> > >>>
> > >>> Thanks,
> > >>> Matt Davis
> > >>>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message