From solr-user-return-148129-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Fri May 31 22:08:53 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8DEF5180627 for ; Sat, 1 Jun 2019 00:08:53 +0200 (CEST) Received: (qmail 5290 invoked by uid 500); 31 May 2019 22:08:49 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 5276 invoked by uid 99); 31 May 2019 22:08:48 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 31 May 2019 22:08:48 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3B81F1A34AC for ; Fri, 31 May 2019 22:08:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.556 X-Spam-Level: X-Spam-Status: No, score=-0.556 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.357, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id tiDkHkb1hKAi for ; Fri, 31 May 2019 22:08:46 +0000 (UTC) Received: from mail-qk1-f194.google.com (mail-qk1-f194.google.com [209.85.222.194]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id AA6435F16F for ; Fri, 31 May 2019 22:08:46 +0000 (UTC) Received: by mail-qk1-f194.google.com with SMTP id w187so7291157qkb.11 for ; Fri, 31 May 2019 15:08:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:content-transfer-encoding:mime-version:subject:date:references :to:in-reply-to:message-id; bh=7UP0FqeN54TBUGnjiVXp/KsVg8B04AShi2ayoI6AVpc=; b=shHVs7m3irl4WG7yWbO7a/x9VczRql155e8OxjQ4d+vP0ckSeWQ6lqKzbF8FXzCfyg qSOnSfPezdAwmT3996R80iaxCLTE3VTkeedfI3tjvBXOP/pQblXOCfPjfNr1ScURsAfK ITiiY208NfHccMDTHyoYj2DR/O2jYKC1K0enDOkDyUcXyOVPhJyHNI/ONWAEN2/SC1xI 38QbfJbNINgC2j5ogjzEwu0cVjsxc+TtxpkuAb+jSrs5Ip+i5sO9Jq0rrSHyte9o3zMs pEiH80WBgnKIXjl5b9EXAHSYL4nJdNsIg3G6ZJL+KhrChz0GhYOTq2rdxNbOg3A9vVEK 1WBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:content-transfer-encoding:mime-version :subject:date:references:to:in-reply-to:message-id; bh=7UP0FqeN54TBUGnjiVXp/KsVg8B04AShi2ayoI6AVpc=; b=YmU2VQ1HB8JalIgL4rX2FJveb02QZwOclonmxdrCKNWwo9Uu2iNA/YI455uZ0r0iDS DyVyylxGs+X5g0ysWkMjxKxZJmDMbsQKUgNAy/CzZv7Ou0uBupBxWUVvO2Vt4oCPT3Ed ow0F6hzMrqqnh5kzk8/ipk+MrpQOEW6b+qdol//vSajGMRhvtEFSoNhmXFrTeLzvEvmg hlEqNTXj+FnSG1FIenBLyQ8/fHjiu6nf8sCXMoJISPy4oqRA6y27ldA46j5e4cpBRqcb fAJYurdrFZqUvVzix846y7ec8R1y0C3HlQCGJdTNl84+wwjPwbAG4O1SxyRbKUcFDWrl cViA== X-Gm-Message-State: APjAAAXwkEUfb76kmWGJDwoean1cjLzLWh6HUiycbMLFN+q+55gQvE2B TGnNd9BY03GObjLqqqSAn5HqtgBy X-Google-Smtp-Source: APXvYqxd4gua1iAKMahI+b4PhGZP5j7z5Q+ZrSjzsXMJnvHWsi6ECUW6/W5uRzRTIDIGMehL9uVBqQ== X-Received: by 2002:a05:620a:1598:: with SMTP id d24mr10489984qkk.348.1559340520116; Fri, 31 May 2019 15:08:40 -0700 (PDT) Received: from [100.70.196.75] ([12.246.51.142]) by smtp.gmail.com with ESMTPSA id q24sm5635277qtq.58.2019.05.31.15.08.38 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 31 May 2019 15:08:39 -0700 (PDT) From: Erick Erickson Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: Empty rows from /export? Date: Fri, 31 May 2019 18:08:38 -0400 References: <81BEB2ED-292B-414B-BF28-48821A99CAF8@wunderwood.org> <5CFE6AB5-561B-41E9-A717-96A61BB00544@gmail.com> <9B9F1636-E2F2-4ACD-8CE6-498E93022B3C@wunderwood.org> To: solr-user@lucene.apache.org In-Reply-To: Message-Id: X-Mailer: Apple Mail (2.3445.104.11) docValues are indeed, realized in Lucene. It=E2=80=99s just that Lucene = has no notion of =E2=80=9Cschema=E2=80=9D. So when you define the = schema, Solr carefully constructs the appropriate low-level Lucene calls = to take care of all of the options you=E2=80=99ve specified in the = schema, things like stored, indexed, docValues etc. when a doc is = indexed. Now we get to optimize. All Solr does is tell Lucene to mash together = all the segments and Lucene does its tricks. Lucene assumes it = =E2=80=9Cknows=E2=80=9D everything it needs to know by what=E2=80=99s = already in the segments it=E2=80=99s merging without reference to = Solr=E2=80=99s schema. Therein lies the rub. If one segment has = docValues for a field and another segment doesn=E2=80=99t, the result is = =E2=80=9Cinteresting=E2=80=9D. In general, Lucene can=E2=80=99t = reconstruct the original data. =46rom Robert Muir: =E2=80=9CI think the key issue here is Lucene is an index not a = database. Because it is a lossy index and does not retain all of the = user's data, its not possible to safely migrate some things = automagically. In the norms case IndexWriter needs to re-analyze the = text ("re-index") and compute stats to get back the value, so it can be = re-encoded. The function is y =3D f(x) and if x is not available its not = possible, so lucene can't do it.=E2=80=9D DocValues is a special case because all the data necessary to all = docValues is already in the index, i.e. the indexed data (assuming you = originally put it in with indexed=3Dtrue). But it requires extra effort, = thus the UninvertDocValuesMergePolicyFactory. >> I was curious if it >> was safe to change the id field to docValues without reindexing I=E2=80=99d be very reluctant. It=E2=80=99s not something that=E2=80=99s = explicitly tested or supported so there=E2=80=99e likely edge cases. Best, Erick > On May 31, 2019, at 2:02 PM, David Hastings = wrote: >=20 >> Ah. So docValues are managed by Solr outside of Lucene. Interesting. >=20 > i was under the impression docValues are in lucene, and he is just = saying > that an optimize is not a re-index, its just taking the actual files = that > already exist in your index and arranging them and removing deletions, = an > optimize doesnt re-read the schema and re-index content >=20 > On Fri, May 31, 2019 at 1:59 PM Walter Underwood = > wrote: >=20 >> Ah. So docValues are managed by Solr outside of Lucene. Interesting. >>=20 >> That actually answers a question I had not asked yet. I was curious = if it >> was safe to change the id field to docValues without reindexing if we = never >> sorted on it. It looks like fetching the value won=E2=80=99t work = until everything >> is reindexed. >>=20 >> It seems like this would be a useful thing to have supported, = migrating a >> field to docValues. >>=20 >> wunder >> Walter Underwood >> wunder@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >>=20 >>> On May 31, 2019, at 5:00 AM, Erick Erickson = >> wrote: >>>=20 >>> bq. but I optimized all the cores, which should rewrite every = segment as >> docValues. >>>=20 >>> Not true. Optimize is a Lucene level force merge. Dealing with = segments, >> i.e. merging and the like, is a low-level Lucene operation and Lucene = has >> no notion of a schema. So a change you made to the schema is = irrelevant to >> merging. >>>=20 >>> You have to have something at the Solr level that does some magic = for >> this to work. Take a look at UninvertDocValuesMergePolicyFactory if = you >> have Solr 7.0 or later. WARNING: I haven=E2=80=99t used that = personally, and I do >> not know what the behavior would be on an index that is =E2=80=9Cmixed=E2= =80=9D, i.e. one >> that already has segments with some docs having DV entries and some = not. >>>=20 >>> Best, >>> Erick >>>=20 >>>> On May 31, 2019, at 12:35 AM, Walter Underwood = >> wrote: >>>>=20 >>>> That field was changed to docValues, but I optimized all the cores, >> which should rewrite every segment as docValues. >>>>=20 >>>> wunder >>>> Walter Underwood >>>> wunder@wunderwood.org >>>> http://observer.wunderwood.org/ (my blog) >>>>=20 >>>>> On May 30, 2019, at 7:37 PM, Erick Erickson = >> wrote: >>>>>=20 >>>>> This is odd. The only reason I know of that would happen is if = there >> were no docValues for that field in those documents. By any chance = were >> docValues added to an existing index without totally reindexing into = a new >> collection? >>>>>=20 >>>>> What happens if you just query the collection rather than the >> individual core? I=E2=80=99m thinking using a streaming expression as = a check=E2=80=A6.. >>>>>=20 >>>>>> On May 30, 2019, at 6:41 PM, Walter Underwood = >> wrote: >>>>>>=20 >>>>>> 3/4 of the documents I=E2=80=99m getting back from /export are = empty. This >> collection has four shards, so I=E2=80=99m querying the leader core = on each shard >> with /export. The results start like this: >>>>>>=20 >>>>>>=20 >> = {"numFound":912370,"docs":[{},{},{},{},{},{},{},{},{},{},{},{},{},{},{},{}= ,{},{},{},{},{},{},{}, >>>>>>=20 >>>>>> The final 1/4 of the results have UUIDs (the ID type). The id = field >> is stored as docValues. This is the URL. >>>>>>=20 >>>>>>=20 >> = http://hostname:8983/solr/decks_shard1_replica1/export?q=3Did:*&distrib=3D= false&shards=3Dshard1&fl=3Did&sort=3Did+asc >>>>>>=20 >>>>>> Running 6.6.2, Solr Cloud. The total number of non-null ids from = all >> four shards is a bit less than 1/4 of the document count. >>>>>>=20 >>>>>> Any ideas about what is going on? >>>>>>=20 >>>>>> wunder >>>>>> Walter Underwood >>>>>> wunder@wunderwood.org >>>>>> http://observer.wunderwood.org/ (my blog) >>>>>>=20 >>>>>=20 >>>>=20 >>>=20 >>=20 >>=20