lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Comparing two indexes for equality - Finding non stored fieldNames per document
Date Tue, 02 Jan 2018 14:44:04 GMT
Luke has some capabilities to look at the index at a low level,
perhaps that could give you some pointers. I think you can pull
the older branch from here:
https://github.com/DmitryKey/luke

or:
https://code.google.com/archive/p/luke/

NOTE: This is not a part of Lucene, but an independent project
so it won't have the same labels.

Best,
Erick

On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss <dawid.weiss@gmail.com> wrote:
> Ok. I think you should look at the Java API -- this will give you more
> clarity of what is actually stored in the index
> and how to extract it. The thing (I think) you're missing is that an
> inverted index points in the "other" direction (from a given value to
> all documents that contained it). So unless you "store" that value
> with the document as a stored field, you'll have to "uninvert" the
> index yourself.
>
> Dawid
>
> On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra
> <chetan.mehrotra@gmail.com> wrote:
>>> Only stored fields are kept for each document. If you need to dump
>>> internal data structures (terms, positions, offsets, payloads, you
>>> name it) you'll need to dive into the API and traverse all segments,
>>> then dump the above (and note that document IDs are per-segment and
>>> will have to be somehow consolidated back to your document IDs).
>>
>> Okie. So this would require deeper understanding of index format.
>> Would have a look. To start with I was just looking for a way to dump
>> indexed field names per document and nothing more
>>
>> /foo/bar|status, lastModified
>> /foo/baz|status, type
>>
>> Where path is stored field (primary key) and rest of the stuff are
>> sorted field names. Then such a file can be generated for both indexes
>> and diff can be done post sorting
>>
>>> I don't quite understand the motive here -- the indexes should behave
>>> identically regardless of the order of input documents; what's the
>>> point of dumping all this information?
>>
>> This is because of way indexing logic is given access to the Node
>> hierarchy. Would try to provide a brief explanation
>>
>> Jackrabbit Oak provides a hierarchical storage in a tree form where
>> sub trees can be of specific type.
>>
>> /content/dam/assets/december/banner.png
>>   - jcr:primaryType = "app:Asset"
>>   + jcr:content
>>     - jcr:primaryType = "app:AssetContent"
>>     + metadata
>>       - status = "published"
>>       - jcr:lastModified = "2009-10-9T21:52:31"
>>       - app:tags = ["properties:orientation/landscape",
>> "marketing:interest/product"]
>>       - comment = "Image for december launch"
>>       - jcr:title = "December Banner"
>>       + xmpMM:History
>>         + 1
>>           - softwareAgent = "Adobe Photoshop"
>>           - author = "David"
>>     + renditions (nt:folder)
>>       + original (nt:file)
>>         + jcr:content
>>           - jcr:data = ...
>>
>> To access this content Oak provides a NodeStore/NodeState api [1]
>> which provides way to access the children. The default indexing logic
>> uses this api to read the content to be indexed and uses index rules
>> which allow to index content via relative path. For e.g. it would
>> create a Lucene field status which maps to
>> jcr:content/metadata/@status (for an index rule for nodes of type
>> app:Asset).
>>
>> This mode of access proved to be slow over remote storage like Mongo
>> specially for full reindexing case. So we implemented a newer approach
>> where all content was dumped in a flat file (1 node per line) ->
>> sorted file and then have a NodeState impl over this flat file. This
>> changes the way how relative paths work and thus there may be some
>> potential bugs in newer implementation.
>>
>> Hence we need to validate that indexing using new api produces same
>> index as using the stable api. For a case both index would have a
>> document for "/content/dam/assets/december/banner.png" but if newer
>> impl had some bug then it may not have indexed the "status" field
>>
>> So I am looking for way where I can map all fieldNames for a given
>> document. Actual indexed content would be same if both index have
>> "status" field indexed so we only need to validate fieldnames per
>> document. Something like
>>
>> Thanks for reading all this if you have read so far :)
>>
>> Chetan Mehrotra
>> [1] https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java
>>
>>
>> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>> Only stored fields are kept for each document. If you need to dump
>>> internal data structures (terms, positions, offsets, payloads, you
>>> name it) you'll need to dive into the API and traverse all segments,
>>> then dump the above (and note that document IDs are per-segment and
>>> will have to be somehow consolidated back to your document IDs).
>>>
>>> I don't quite understand the motive here -- the indexes should behave
>>> identically regardless of the order of input documents; what's the
>>> point of dumping all this information?
>>>
>>> Dawid
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message