From solr-user-return-148339-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Mon Jun 10 17:55:18 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id D7A53180649 for ; Mon, 10 Jun 2019 19:55:17 +0200 (CEST) Received: (qmail 59874 invoked by uid 500); 10 Jun 2019 17:55:10 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 59857 invoked by uid 99); 10 Jun 2019 17:55:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 10 Jun 2019 17:55:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 4A18B1808A6 for ; Mon, 10 Jun 2019 17:55:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.198 X-Spam-Level: X-Spam-Status: No, score=-0.198 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id LTWJpEAS48c4 for ; Mon, 10 Jun 2019 17:55:07 +0000 (UTC) Received: from mail-pf1-f180.google.com (mail-pf1-f180.google.com [209.85.210.180]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 6D5075FB7C for ; Mon, 10 Jun 2019 17:55:07 +0000 (UTC) Received: by mail-pf1-f180.google.com with SMTP id a186so5737301pfa.5 for ; Mon, 10 Jun 2019 10:55:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:subject:date:references :to:in-reply-to:message-id; bh=iIzWvgeBPgU7Dm3f1DNOom+F5tr8xVHzKGvcMlM659o=; b=LCC17/fgyIk60OWm6jLFEoDjaPNtSV8+8tytKJF2/3ua0RQE4cqnLcMNXBK2LYcWgk b/rfUJB3s2roayH6BJQjg46T2aCEag/69sokqWN+nzK+MV6Sgg6nhSRTnd4L1t7bTt2z I1T/tb8rC9xLQfx9VjPugHg2TCUHLCZmEaPPK6I2syLSQvwIwGKeiSoCaLmph6e47xUY QCvTbqd/52Udxr1qoXooBxZ+K9D71y7C3XF9kWulKhg24N5aMo9wUHbXprmZaLbg2uJ4 pVQGyB3j3FIA9DyJHmik7C07eLN63c3WjxbreiL9fqU/h3Iil7P/9YlePvr1iR7RiyYX fUeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:date:references:to:in-reply-to:message-id; bh=iIzWvgeBPgU7Dm3f1DNOom+F5tr8xVHzKGvcMlM659o=; b=tH4b7HDzU8zVybTAUHjUQ2olNQ50XGq+UjsCCv1bELfRU/uUpR7mkgefht4plFh19M ADeMNTwoJtEQKPk8PVSfosv5+xEMIoliJyK3MPOpRfsydSn1q+8TEnhARg9UowHUJ9K+ KPcGQaIIDOmETq0eWg6lUoZXL8iok6nm9iBUkusHPSj5tIMFy0VLfeWvIGjmHhoA7WDT x/tFEkqvrmKNuvSmnuwKy/AGzv2ZwzYBFDHHSCfNJp18E3MTzTAfeA4UNY7rzvGAd5GP 5EWqz4YPX6aCQocXnik2vODAg/IandzzTmf20T4rMpDltEX3OD0IoCEXHot4OvrYuHKN imqw== X-Gm-Message-State: APjAAAU3XTYFXDqgdcfU0jWNvDXkb+8che9qIlFSuWV+WnP2xRFIYrGM gNTApb77IBIKsZzKbNBkIR6spB2Y X-Google-Smtp-Source: APXvYqyWe5gn35V+gqAeHCGvNUujmdqUnn03nakFcgy3oX7jptXCFdzWsb2rryjT3weiM75FX1MFeA== X-Received: by 2002:aa7:8106:: with SMTP id b6mr3405781pfi.5.1560189299628; Mon, 10 Jun 2019 10:54:59 -0700 (PDT) Received: from [192.168.1.122] (c-73-162-90-234.hsd1.ca.comcast.net. [73.162.90.234]) by smtp.gmail.com with ESMTPSA id n32sm106670pji.29.2019.06.10.10.54.57 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 10 Jun 2019 10:54:58 -0700 (PDT) From: Erick Erickson Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: Enabling/disabling docValues Date: Mon, 10 Jun 2019 10:54:56 -0700 References: <838FE964-9A0A-4501-A049-179964AAAAFA@gmail.com> To: solr-user@lucene.apache.org In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3445.104.11) bq. Does lucene look at %docs in each state, or the first doc or = something else? Frankly I don=E2=80=99t care since no matter what, the results of = faceting mixed definitions is not useful. tl;dr; =E2=80=9CWhen I use a word,=E2=80=99 Humpty Dumpty said in rather a = scornful tone, =E2=80=98it means just what I choose it to mean =E2=80=94 = neither more nor less.=E2=80=99 So =E2=80=9Cundefined" in this case means =E2=80=9CI don=E2=80=99t see = any value at all in chasing that info down=E2=80=9D ;). Changing from regular text to SortableText means that the results will = be inaccurate no matter what. For example, I have a doc with the value = =E2=80=9Cmy dog has fleas=E2=80=9D. When NOT using SortableText, there = are multiple tokens so facet counts would be: my (1) dog (1) has (1) fleas (1) But for SortableText will be: my dog has fleas (1) Consider doc1 with =E2=80=9Cmy dog has fleas=E2=80=9D and doc2 with = =E2=80=9Cmy cat has fleas=E2=80=9D. doc1 was indexed before switching = to SortableText and doc2 after. Presumably the output you want is: my dog has fleas (1) my cat has fleas (1) But you can=E2=80=99t get that output. There are three cases: 1> Lucene treats all documents as SortableText, faceting on the = docValues parts. No facets on doc1 my cat has fleas (1)=20 2> Lucene treats all documents as tokenized, faceting on each individual = token. Faceting is performed on the tokenized content of both, = docValues in doc2 ignored my (2) dog (1) has (2) fleas (2) cat (1) 3> Lucene does the best it can, faceting on the tokens for docs without = SortableText and docValues if the doc was indexed with Sortable text. = doc1 faceted on tokenized, doc2 on docValues my (1) dog (1) has (1) fleas (1) my cat has fleas (1) Since none of those cases is what I want, there=E2=80=99s no point I can = see in chasing down what actually happens=E2=80=A6. Best, Erick P.S. I _think_ Lucene tries to use the definition from the first = segment, but since whether the lists of segments to be merged don=E2=80=99= t look at the field definitions at all. Whether the first segment in the = list has SortableText or not will not be predictable in a general way = even within a single run. > On Jun 9, 2019, at 6:53 PM, John Davis = wrote: >=20 > Understood, however code is rarely random/undefined. Does lucene look = at % > docs in each state, or the first doc or something else? >=20 > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson = > wrote: >=20 >> It=E2=80=99s basically undefined. When segments are merged that have = dissimilar >> definitions like this what can Lucene do? Consider: >>=20 >> Faceting on a text (not sortable) means that each individual token in = the >> index is uninverted on the Java heap and the facets are computed for = each >> individual term. >>=20 >> Faceting on a SortableText field just has a single term per document, = and >> that in the docValues structures as opposed to the inverted index. >>=20 >> Now you change the value and start indexing. At some point a segment >> containing no docValues is merged with a segment containing docValues = for >> the field. The resulting mixed segment is in this state. If you facet = on >> the field, should the docs without docValues have each individual = term >> counted? Or just the SortableText values in the docValues structure? >> Neither one is right. >>=20 >> Also remember that Lucene has no notion of schema. That=E2=80=99s = entirely imposed >> on Lucene by Solr carefully constructing low-level analysis chains. >>=20 >> So I=E2=80=99d _strongly_ recommend you re-index your corpus to a new = collection >> with the current definition, then perhaps use CREATEALIAS to = seamlessly >> switch. >>=20 >> Best, >> Erick >>=20 >>> On Jun 9, 2019, at 12:50 PM, John Davis >> wrote: >>>=20 >>> Hi there, >>> We recently changed a field from TextField + no docValues to >>> SortableTextField which has docValues enabled by default. Once I did >> this I >>> do not see any facet values for the field. I know that once all the = docs >>> are re-indexed facets should work again, however can someone clarify = the >>> current logic of lucene/solr how facets will be computed when schema = is >>> changed from no docValues to docValues and vice-versa? >>>=20 >>> 1. Until ALL the docs are re-indexed, no facets will be returned? >>> 2. Once certain fraction of docs are re-indexed, those facets will = be >>> returned? >>> 3. Something else? >>>=20 >>>=20 >>> Varun >>=20 >>=20