From java-user-return-64693-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Thu Jan  2 16:00:16 2020
Return-Path: <java-user-return-64693-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 3AE3A180647
	for <archive-asf-public@cust-asf.ponee.io>; Thu,  2 Jan 2020 17:00:16 +0100 (CET)
Received: (qmail 66446 invoked by uid 500); 2 Jan 2020 16:00:14 -0000
Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:java-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:java-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:java-user@lucene.apache.org>
List-Id: <java-user.lucene.apache.org>
Reply-To: java-user@lucene.apache.org
Delivered-To: mailing list java-user@lucene.apache.org
Received: (qmail 66434 invoked by uid 99); 2 Jan 2020 16:00:14 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jan 2020 16:00:14 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A72651A40DF
	for <java-user@lucene.apache.org>; Thu,  2 Jan 2020 16:00:13 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 0.251
X-Spam-Level:
X-Spam-Status: No, score=0.251 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=0.2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id QRlpbjQAJ6Rm for <java-user@lucene.apache.org>;
	Thu,  2 Jan 2020 16:00:11 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::336; helo=mail-ot1-x336.google.com; envelope-from=kryptonics411@gmail.com; receiver=<UNKNOWN> 
Received: from mail-ot1-x336.google.com (mail-ot1-x336.google.com [IPv6:2607:f8b0:4864:20::336])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 03BD37E11C
	for <java-user@lucene.apache.org>; Thu,  2 Jan 2020 16:00:10 +0000 (UTC)
Received: by mail-ot1-x336.google.com with SMTP id 77so57539813oty.6
        for <java-user@lucene.apache.org>; Thu, 02 Jan 2020 08:00:10 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=s1bE7fdoUrZ+Iu0dzRFPL+keE56CgMNNPox4jkiqJL4=;
        b=P0jKWyuugKY4O8CIxzkATg1buHkQtrAhhsL3WIRQGfTFd8gUgG24L11N0o+O+9DqRP
         v6xY9ERfD03UL/P+gHPF1DyC5d0RFz/vgvkau8BE/fX3BJVdHGvZK2lQS/Y1LvI+WJgc
         WqSboTMySSBhXGv1trRc3V2gSNTQRAJZLG6TfJ/dUIafJC4xmFH4h/uaYdcSslEtMGs5
         VgTiFa4c1bqKw+0wkHA5bIAzyFmjpsT2iO5nxVp7Pgw4mmf96RpURZ3+wNuCD8f4dSu9
         87tdVIZNBYN0uUki9jfb7OoPEup6myRj1tvZyJ9YX4ZPRqAdq9nnJA446Jstl2IYDBvM
         EVGQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=s1bE7fdoUrZ+Iu0dzRFPL+keE56CgMNNPox4jkiqJL4=;
        b=sMMkKbnEoMfcT4sCuzmK4hbHO6EFkDYCqtSYxJ/7ODB/Hk0eCx73Cqgji0O3hKgj1K
         bE5t/B8IzitZyVFx4iQxtKCuR79x9t5kP6LGDIgEz0/IGTye5+tORiEX7SZzMi/j2CH8
         hBvii7n7q1ggZR+PtvNxIZUX+okhQYGrOJ57SAIhuBavtHAjcRIS3PlMQFSy2oH6IXul
         CTpnvIGiJedQCOUeaJgLFMd1U0QG17oCrLsoTu/No4OKJ79/RPxQ3ZhghQNh7cyVqZFj
         ziKQvQtgTAtZM2Zx2BD1vf6DEvwoStMuf3ZRGaMwtZ2AIxhqS1QfZ26uBh+HaJzNrh+E
         /GdA==
X-Gm-Message-State: APjAAAVWOfQxnzRWb11Ezkg4xqvZKDo9dEsPA10Aos04BOay2cuLGo3o
	gewa3Z4ix0e/oBWNkF6HLAJz2DfkXPQiWX/6SQMeZvtv
X-Google-Smtp-Source: APXvYqy/10OQ8KZoJlNcR//qBLJ5K3tgoojNs/Ysudt6YTjgvJ/pdxzzeJm2X0HRLTxE/U0rPxgEuhishvLaV2xiwxc=
X-Received: by 2002:a05:6830:2102:: with SMTP id i2mr90494368otc.123.1577980803750;
 Thu, 02 Jan 2020 08:00:03 -0800 (PST)
MIME-Version: 1.0
References: <CAKAo3tQMUMun4ouAKCKHBLsimewEpu3+7g-rJy8+VzEvJZUq6Q@mail.gmail.com>
 <CAGUSZHC4bw7bNDQnyGGZHbWC8d1qHfYh0BJQRorfdgQU7DZJcA@mail.gmail.com>
 <CAKAo3tTZF0WQfuaJrA48dRi0VXJFkbZ503fatGm7_pmLwoZ5cA@mail.gmail.com>
 <2D247EE3-B9B7-4C8D-A54D-DE10D1825D1F@gmail.com> <CAL8PwkYey9ZZBiaw0+3ncqxaffzLsqR95o1_Tg4om1WW3uxquw@mail.gmail.com>
In-Reply-To: <CAL8PwkYey9ZZBiaw0+3ncqxaffzLsqR95o1_Tg4om1WW3uxquw@mail.gmail.com>
From: Matt Davis <kryptonics411@gmail.com>
Date: Thu, 2 Jan 2020 10:58:35 -0500
Message-ID: <CAKAo3tSMUNA13r7uNwt1W+DkkUmM86LVtV9Jv3YEAtMo3Qnmiw@mail.gmail.com>
Subject: Re: Searching number of tokens in text field
To: java-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="000000000000451f1d059b2a4993"

--000000000000451f1d059b2a4993
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thanks Mike that is very helpful.  Am I reading the code correctly that the
norm lossy encoding is done in the similarity?  How do you set the number
of bytes used for the norms?

Thanks,
Matt

On Thu, Jan 2, 2020 at 10:31 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Norms encode the number of tokens in the field, but in a lossy manner (1
> byte by default), so you could probably create a custom query that filter=
ed
> based on that, if you could tolerate the loss in precision?  Or maybe
> change your norms storage to more precision?
>
> You could use NormsFieldExistsQuery as a starting point for the sources f=
or
> your custom query.  Or maybe there's already a more similar Query based o=
n
> norms?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson <erickerickson@gmail.com>
> wrote:
>
> > This comes up occasionally, it=E2=80=99d be a neat thing to add to Solr=
 if you=E2=80=99re
> > motivated. It gets tricky though.
> >
> > - part of the config would have to be the name of the length field to p=
ut
> > the result into, that part=E2=80=99s easy.
> >
> > - The trickier part is =E2=80=9Cwhen should the count be incremented?=
=E2=80=9D. For
> > instance, say you add 15 synonyms for a particular word. Would that add=
 1
> > or 16 to the count? What about WordDelimiterGraphFilterFactory, that ca=
n
> > output N tokens in place of one. Do stopwords count? What about shingle=
s?
> > CJK languages? The list goes on.
> >
> > If you tackle this I suggest you open a JIRA for discussion, probably a
> > Lucene JIRA =E2=80=98cause the folks who deal with Lucene would have th=
e best
> > feedback. And probably ignore most of the possible interactions with
> other
> > filters and document that most users should just put it immediately aft=
er
> > the tokenizer and leave it at that ;)
> >
> > I can think of a few other options, but about the only thing that I thi=
nk
> > makes sense is something like =E2=80=9CcountTokensInTheSamePosition=3Dt=
rue|false=E2=80=9D
> > (there=E2=80=99s _GOT_ to be a better name for that!), defaulting to fa=
lse so you
> > could control whether synonym expansion and WDGFF insertions incremente=
d
> > the count or not. And I suspect that if you put such a filter after
> WDGFF,
> > you=E2=80=99d also want to document that it should go after
> > FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over
> my
> > suspicion...
> >
> > Best,
> > Erick
> >
> > > On Dec 29, 2019, at 7:57 PM, Matt Davis <kryptonics411@gmail.com>
> wrote:
> > >
> > > That is a clever idea.  I would still prefer something cleaner but th=
is
> > > could work.  Thanks!
> > >
> > > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov <msokolov@gmail.com>
> > wrote:
> > >
> > >> I don't know of any pre-existing thing that does exactly this, but h=
ow
> > >> about a token filter that counts tokens (or positions maybe), and th=
en
> > >> appends some special token encoding the length?
> > >>
> > >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis <kryptonics411@gmail.com>
> > wrote:
> > >>
> > >>> Hello,
> > >>>
> > >>> I was wondering if it is possible to search for the number of token=
s
> > in a
> > >>> text field.  For example find book titles with 3 or more words.  I
> > don't
> > >>> mind adding a field that is the number of tokens to the search inde=
x
> > but
> > >> I
> > >>> would like to avoid analyzing the text two times.   Can Lucene sear=
ch
> > for
> > >>> the number of tokens in a text field?  Or can I get the number of
> > tokens
> > >>> after analysis and add it to the Lucene document before/during
> > indexing?
> > >>> Or do I need to analysis the text myself and add the field to the
> > >> document
> > >>> (analyze the text twice, once myself, once in the IndexWriter).
> > >>>
> > >>> Thanks,
> > >>> Matt Davis
> > >>>
> > >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

--000000000000451f1d059b2a4993--