Return-Path: X-Original-To: archive-asf-public@eu.ponee.io Delivered-To: archive-asf-public@eu.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by mx-eu-01.ponee.io (Postfix) with ESMTP id B2DA3180630 for ; Tue, 2 Jan 2018 15:44:51 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id A2FB5160C26; Tue, 2 Jan 2018 14:44:51 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C1D54160C1B for ; Tue, 2 Jan 2018 15:44:50 +0100 (CET) Received: (qmail 60338 invoked by uid 500); 2 Jan 2018 14:44:49 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 60309 invoked by uid 99); 2 Jan 2018 14:44:49 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 02 Jan 2018 14:44:49 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 9EF3818071E for ; Tue, 2 Jan 2018 14:44:48 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 0.679 X-Spam-Level: X-Spam-Status: No, score=0.679 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id RHO8zAohNuFc for ; Tue, 2 Jan 2018 14:44:46 +0000 (UTC) Received: from mail-lf0-f51.google.com (mail-lf0-f51.google.com [209.85.215.51]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 851895F666 for ; Tue, 2 Jan 2018 14:44:45 +0000 (UTC) Received: by mail-lf0-f51.google.com with SMTP id a12so19590858lfe.13 for ; Tue, 02 Jan 2018 06:44:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=IJ8X3fZ0eT1hziUfIPzMrcO/fcV2sym/9xh0XzuWSN0=; b=FwX5RpU9X1sWSfpe+4NPiH9vhjptXWrZ+OxCR6ZXU2Sf/7x60jrGDBSuYjfNlByUOt RPq48zONACK17gx08XjFeXiuf7E/FlP05bGparQtOC7aElPqm8Yy6M2MPfDAGOcI5OsK cjxwSYoc2s1njKs5S+0tk/Ct8cZrk+uwWlQoBo7MnH7lUFSppMyihnJYGqkTyod7u5g8 Z9S6kmym9jPVXuX4lmtSUloeBffOqCpYEZqygsg2ZuOm0YQk6Q9PeiCg87QVUXhg6Vby nzY5YsX/ZkuiPhEwGMYOK7lM3uJSnOwTV3y84x6PF/NncpoIMLiiHIfu3Guf9ZOyVSGx Mbuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=IJ8X3fZ0eT1hziUfIPzMrcO/fcV2sym/9xh0XzuWSN0=; b=P9SdWAVZoPkHuRQ7q+FuJZ8y7w5H99pdDSP7kdx3isGfcEollUMEdo+J2qiGTjyiir K6gPuhFv3licJvmCXuTlRhcnALguElcYxy7VxXkjFCZgXo0UhgAFr5hpZX94iZDlfEzH ELN2RxIUL5lexbnyKvurLXXr7XeGmL51DUtbAp/YsHXlXzBvhlPojvfxKJ5lnR9SkXsW PJDOnEuIf9y5SftqlBia41Z+hSRwm2y1Eslp2iqt/oHEPp6m/XsUP8RjYLBD73SvLp0G 1th6wPuTWCkuXGT7EoAZlylCjAS45ri9NOWhS/Nq6VH0iEv4Tgp4mfSOUxf5TOdR2vsG yXHA== X-Gm-Message-State: AKGB3mKzMQE12aC5R1DQBRCPv/kf3ZuRCvynGxaJZMPtU9xX1C8ku8OP 6j33IWjnLdSV1bqBP5be7cWI0d/hixJGZq6Dtq66NkW+ X-Google-Smtp-Source: ACJfBotB5NNDb8wWmFcDXcmGWaQtjFV+KYVz1DkuzfMm8Hpe5oj/HY+FqQgoKTWuaXdFeX48Gb0o62V7etyYCurrD1c= X-Received: by 10.25.159.84 with SMTP id i81mr10919179lfe.47.1514904284707; Tue, 02 Jan 2018 06:44:44 -0800 (PST) MIME-Version: 1.0 Received: by 10.25.140.19 with HTTP; Tue, 2 Jan 2018 06:44:04 -0800 (PST) In-Reply-To: References: From: Erick Erickson Date: Tue, 2 Jan 2018 06:44:04 -0800 Message-ID: Subject: Re: Comparing two indexes for equality - Finding non stored fieldNames per document To: java-user Content-Type: text/plain; charset="UTF-8" archived-at: Tue, 02 Jan 2018 14:44:51 -0000 Luke has some capabilities to look at the index at a low level, perhaps that could give you some pointers. I think you can pull the older branch from here: https://github.com/DmitryKey/luke or: https://code.google.com/archive/p/luke/ NOTE: This is not a part of Lucene, but an independent project so it won't have the same labels. Best, Erick On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss wrote: > Ok. I think you should look at the Java API -- this will give you more > clarity of what is actually stored in the index > and how to extract it. The thing (I think) you're missing is that an > inverted index points in the "other" direction (from a given value to > all documents that contained it). So unless you "store" that value > with the document as a stored field, you'll have to "uninvert" the > index yourself. > > Dawid > > On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra > wrote: >>> Only stored fields are kept for each document. If you need to dump >>> internal data structures (terms, positions, offsets, payloads, you >>> name it) you'll need to dive into the API and traverse all segments, >>> then dump the above (and note that document IDs are per-segment and >>> will have to be somehow consolidated back to your document IDs). >> >> Okie. So this would require deeper understanding of index format. >> Would have a look. To start with I was just looking for a way to dump >> indexed field names per document and nothing more >> >> /foo/bar|status, lastModified >> /foo/baz|status, type >> >> Where path is stored field (primary key) and rest of the stuff are >> sorted field names. Then such a file can be generated for both indexes >> and diff can be done post sorting >> >>> I don't quite understand the motive here -- the indexes should behave >>> identically regardless of the order of input documents; what's the >>> point of dumping all this information? >> >> This is because of way indexing logic is given access to the Node >> hierarchy. Would try to provide a brief explanation >> >> Jackrabbit Oak provides a hierarchical storage in a tree form where >> sub trees can be of specific type. >> >> /content/dam/assets/december/banner.png >> - jcr:primaryType = "app:Asset" >> + jcr:content >> - jcr:primaryType = "app:AssetContent" >> + metadata >> - status = "published" >> - jcr:lastModified = "2009-10-9T21:52:31" >> - app:tags = ["properties:orientation/landscape", >> "marketing:interest/product"] >> - comment = "Image for december launch" >> - jcr:title = "December Banner" >> + xmpMM:History >> + 1 >> - softwareAgent = "Adobe Photoshop" >> - author = "David" >> + renditions (nt:folder) >> + original (nt:file) >> + jcr:content >> - jcr:data = ... >> >> To access this content Oak provides a NodeStore/NodeState api [1] >> which provides way to access the children. The default indexing logic >> uses this api to read the content to be indexed and uses index rules >> which allow to index content via relative path. For e.g. it would >> create a Lucene field status which maps to >> jcr:content/metadata/@status (for an index rule for nodes of type >> app:Asset). >> >> This mode of access proved to be slow over remote storage like Mongo >> specially for full reindexing case. So we implemented a newer approach >> where all content was dumped in a flat file (1 node per line) -> >> sorted file and then have a NodeState impl over this flat file. This >> changes the way how relative paths work and thus there may be some >> potential bugs in newer implementation. >> >> Hence we need to validate that indexing using new api produces same >> index as using the stable api. For a case both index would have a >> document for "/content/dam/assets/december/banner.png" but if newer >> impl had some bug then it may not have indexed the "status" field >> >> So I am looking for way where I can map all fieldNames for a given >> document. Actual indexed content would be same if both index have >> "status" field indexed so we only need to validate fieldnames per >> document. Something like >> >> Thanks for reading all this if you have read so far :) >> >> Chetan Mehrotra >> [1] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java >> >> >> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss wrote: >>> Only stored fields are kept for each document. If you need to dump >>> internal data structures (terms, positions, offsets, payloads, you >>> name it) you'll need to dive into the API and traverse all segments, >>> then dump the above (and note that document IDs are per-segment and >>> will have to be somehow consolidated back to your document IDs). >>> >>> I don't quite understand the motive here -- the indexes should behave >>> identically regardless of the order of input documents; what's the >>> point of dumping all this information? >>> >>> Dawid >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org