From java-user-return-64693-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Thu Jan 2 16:00:16 2020 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3AE3A180647 for ; Thu, 2 Jan 2020 17:00:16 +0100 (CET) Received: (qmail 66446 invoked by uid 500); 2 Jan 2020 16:00:14 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 66434 invoked by uid 99); 2 Jan 2020 16:00:14 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Jan 2020 16:00:14 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id A72651A40DF for ; Thu, 2 Jan 2020 16:00:13 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.251 X-Spam-Level: X-Spam-Status: No, score=0.251 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=0.2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id QRlpbjQAJ6Rm for ; Thu, 2 Jan 2020 16:00:11 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::336; helo=mail-ot1-x336.google.com; envelope-from=kryptonics411@gmail.com; receiver= Received: from mail-ot1-x336.google.com (mail-ot1-x336.google.com [IPv6:2607:f8b0:4864:20::336]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 03BD37E11C for ; Thu, 2 Jan 2020 16:00:10 +0000 (UTC) Received: by mail-ot1-x336.google.com with SMTP id 77so57539813oty.6 for ; Thu, 02 Jan 2020 08:00:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=s1bE7fdoUrZ+Iu0dzRFPL+keE56CgMNNPox4jkiqJL4=; b=P0jKWyuugKY4O8CIxzkATg1buHkQtrAhhsL3WIRQGfTFd8gUgG24L11N0o+O+9DqRP v6xY9ERfD03UL/P+gHPF1DyC5d0RFz/vgvkau8BE/fX3BJVdHGvZK2lQS/Y1LvI+WJgc WqSboTMySSBhXGv1trRc3V2gSNTQRAJZLG6TfJ/dUIafJC4xmFH4h/uaYdcSslEtMGs5 VgTiFa4c1bqKw+0wkHA5bIAzyFmjpsT2iO5nxVp7Pgw4mmf96RpURZ3+wNuCD8f4dSu9 87tdVIZNBYN0uUki9jfb7OoPEup6myRj1tvZyJ9YX4ZPRqAdq9nnJA446Jstl2IYDBvM EVGQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=s1bE7fdoUrZ+Iu0dzRFPL+keE56CgMNNPox4jkiqJL4=; b=sMMkKbnEoMfcT4sCuzmK4hbHO6EFkDYCqtSYxJ/7ODB/Hk0eCx73Cqgji0O3hKgj1K bE5t/B8IzitZyVFx4iQxtKCuR79x9t5kP6LGDIgEz0/IGTye5+tORiEX7SZzMi/j2CH8 hBvii7n7q1ggZR+PtvNxIZUX+okhQYGrOJ57SAIhuBavtHAjcRIS3PlMQFSy2oH6IXul CTpnvIGiJedQCOUeaJgLFMd1U0QG17oCrLsoTu/No4OKJ79/RPxQ3ZhghQNh7cyVqZFj ziKQvQtgTAtZM2Zx2BD1vf6DEvwoStMuf3ZRGaMwtZ2AIxhqS1QfZ26uBh+HaJzNrh+E /GdA== X-Gm-Message-State: APjAAAVWOfQxnzRWb11Ezkg4xqvZKDo9dEsPA10Aos04BOay2cuLGo3o gewa3Z4ix0e/oBWNkF6HLAJz2DfkXPQiWX/6SQMeZvtv X-Google-Smtp-Source: APXvYqy/10OQ8KZoJlNcR//qBLJ5K3tgoojNs/Ysudt6YTjgvJ/pdxzzeJm2X0HRLTxE/U0rPxgEuhishvLaV2xiwxc= X-Received: by 2002:a05:6830:2102:: with SMTP id i2mr90494368otc.123.1577980803750; Thu, 02 Jan 2020 08:00:03 -0800 (PST) MIME-Version: 1.0 References: <2D247EE3-B9B7-4C8D-A54D-DE10D1825D1F@gmail.com> In-Reply-To: From: Matt Davis Date: Thu, 2 Jan 2020 10:58:35 -0500 Message-ID: Subject: Re: Searching number of tokens in text field To: java-user@lucene.apache.org Content-Type: multipart/alternative; boundary="000000000000451f1d059b2a4993" --000000000000451f1d059b2a4993 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks Mike that is very helpful. Am I reading the code correctly that the norm lossy encoding is done in the similarity? How do you set the number of bytes used for the norms? Thanks, Matt On Thu, Jan 2, 2020 at 10:31 AM Michael McCandless < lucene@mikemccandless.com> wrote: > Norms encode the number of tokens in the field, but in a lossy manner (1 > byte by default), so you could probably create a custom query that filter= ed > based on that, if you could tolerate the loss in precision? Or maybe > change your norms storage to more precision? > > You could use NormsFieldExistsQuery as a starting point for the sources f= or > your custom query. Or maybe there's already a more similar Query based o= n > norms? > > Mike McCandless > > http://blog.mikemccandless.com > > > On Mon, Dec 30, 2019 at 8:07 AM Erick Erickson > wrote: > > > This comes up occasionally, it=E2=80=99d be a neat thing to add to Solr= if you=E2=80=99re > > motivated. It gets tricky though. > > > > - part of the config would have to be the name of the length field to p= ut > > the result into, that part=E2=80=99s easy. > > > > - The trickier part is =E2=80=9Cwhen should the count be incremented?= =E2=80=9D. For > > instance, say you add 15 synonyms for a particular word. Would that add= 1 > > or 16 to the count? What about WordDelimiterGraphFilterFactory, that ca= n > > output N tokens in place of one. Do stopwords count? What about shingle= s? > > CJK languages? The list goes on. > > > > If you tackle this I suggest you open a JIRA for discussion, probably a > > Lucene JIRA =E2=80=98cause the folks who deal with Lucene would have th= e best > > feedback. And probably ignore most of the possible interactions with > other > > filters and document that most users should just put it immediately aft= er > > the tokenizer and leave it at that ;) > > > > I can think of a few other options, but about the only thing that I thi= nk > > makes sense is something like =E2=80=9CcountTokensInTheSamePosition=3Dt= rue|false=E2=80=9D > > (there=E2=80=99s _GOT_ to be a better name for that!), defaulting to fa= lse so you > > could control whether synonym expansion and WDGFF insertions incremente= d > > the count or not. And I suspect that if you put such a filter after > WDGFF, > > you=E2=80=99d also want to document that it should go after > > FlattenGraphFilterFactory, but trust any feedback on a Lucene JIRA over > my > > suspicion... > > > > Best, > > Erick > > > > > On Dec 29, 2019, at 7:57 PM, Matt Davis > wrote: > > > > > > That is a clever idea. I would still prefer something cleaner but th= is > > > could work. Thanks! > > > > > > On Sat, Dec 28, 2019 at 10:11 PM Michael Sokolov > > wrote: > > > > > >> I don't know of any pre-existing thing that does exactly this, but h= ow > > >> about a token filter that counts tokens (or positions maybe), and th= en > > >> appends some special token encoding the length? > > >> > > >> On Sat, Dec 28, 2019, 9:36 AM Matt Davis > > wrote: > > >> > > >>> Hello, > > >>> > > >>> I was wondering if it is possible to search for the number of token= s > > in a > > >>> text field. For example find book titles with 3 or more words. I > > don't > > >>> mind adding a field that is the number of tokens to the search inde= x > > but > > >> I > > >>> would like to avoid analyzing the text two times. Can Lucene sear= ch > > for > > >>> the number of tokens in a text field? Or can I get the number of > > tokens > > >>> after analysis and add it to the Lucene document before/during > > indexing? > > >>> Or do I need to analysis the text myself and add the field to the > > >> document > > >>> (analyze the text twice, once myself, once in the IndexWriter). > > >>> > > >>> Thanks, > > >>> Matt Davis > > >>> > > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > --000000000000451f1d059b2a4993--