lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-4658) Per-segment tracking of external/side-car data
Date Sun, 17 Mar 2013 17:21:14 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-4658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13604671#comment-13604671
] 

Shai Erera commented on LUCENE-4658:
------------------------------------

bq. ExternalField maybe?

I think of ExternalField as something that resides outside the index, while CustomField is
part of the index. Therefore I prefer custom vs external, but that's just naming.

First, this issue may not be used by facets at all. And I agree with Robert that there's no
point making two implementations for a custom data format. Today we have the payloads and
BDV as enablers to encode arbitrary data into a byte[] (BDV is faster). I think that should
be enough, as long as what you want is a per-document custom data.

But if you want to encode per-segment global data (e.g. a taxonomy, a graph), then BDV (or
payload) are not the right API as they are per-document. Rather, I think it will be good if
we have this CustomDataFormat which is completely opaque to Lucene, yet gives the app a lot
of flexibility: CustomField passed on Documents (at least in my scenarios these per-document
datum comprise the larger per-segment data structure) takes an Object, CustomDataFormat encodes
them however it needs, and is also responsible for merging across segments, IR gives you a
CustomData back. That's it. You app can then cast and work with that data however it wants.
We can have the getCustomData take a field, in case you want to encode two such structures,
but we don't need to at first.

If for some reason the app needs custom data per-document and cannot work with neither payloads
nor BDV, then it needs to have a CustomData type that exposes per-document API. In either
case, Lucene should not care what's in that data except in the indexing chain (to call the
right format's API) and during merge, to invoke CustomDataFormat.merge().

I hope that's enough?
                
> Per-segment tracking of external/side-car data
> ----------------------------------------------
>
>                 Key: LUCENE-4658
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4658
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>         Attachments: LUCENE-4658.patch, LUCENE-4658.patch
>
>
> Spinoff from David's idea on LUCENE-4258
> (https://issues.apache.org/jira/browse/LUCENE-4258?focusedCommentId=13534352&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13534352
)
> I made a prototype patch that allows custom per-segment "side-car
> data".  It adds an abstract ExternalSegmentData class.  The idea is
> the app implements this, and IndexWriter will pass each Document
> through to it, and call on it to do flushing/merging.  I added a
> setter to IndexWriterConfig to enable it, but I think this would
> really belong in Codec ...
> I haven't tackled the read-side yet, though this is already usable
> without that (ie, the app can just open its own files, read them,
> etc.).
> The random test case passes.
> I think for example this might make it easier for Solr/ElasticSearch
> to implement things like ExternalFileField.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message