lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
Date Tue, 01 Mar 2011 18:42:37 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13001011#comment-13001011
] 

Michael McCandless commented on LUCENE-2881:
--------------------------------------------


This is a big patch!  Some comments...:

  * Do we even use FieldInfos.add(Document)?  Maybe we can remove
    it...

  * FieldNumberBiMap.addOrGet doesn't need to take the FieldInfoBiMap?
    (It's an unused param).

  * Why create the FieldInfoBiMap class?  Ie, why not "merge" that
    back into FieldInfos itself (like it used to be)?  (I understand
    why we need FieldNumberBiMap -- so we can share a single instance
    across FieldInfos).

  * We mix up autoboxing then unboxing, eg
    FieldInfos.addOrUpdateInternal takes int preferredFieldNumber,
    which is boxed when calling localFieldInfos.nextFieldNumber, then
    manually unboxed on calling globalFieldNumbers.addOrGet.

  * On working with a pre-4.0 index that has non-congruent assignments
    across segments... I fear we may not necessarily ever "stabilize" on
    a fixed global name/number bimap, because we re-compute this map
    on every IW init?  Ie, when you first open 4.0 IW on pre-4.0
    index, you'll compute a certain global map, and write new segs
    with those bindings.  But say some fields differed in their
    assignment... but then some of those conflicting segs are merged.
    Later, when you open the IW again, you'll get a different global
    map?  And write new segments conflicting with the previous new
    segments you had written?

  * The fact that SegmentInfo.clearFilesCache is now a public API and
    consumer is responsible for knowing when to call this
    is... spooky.  Previously this cache was a fully private thing (ie
    invalidated whenever a change was made to the SegmentInfo).  But I
    don't see any way around it; since we now embed a FieldInfos and
    that FieldInfos (hasVectors) could change... maybe, whenever we
    call this method, add a comment explaining why?

  * It makes me nervous that the API that's allowed to pick a new
    field number (FieldInfo.addInternal) is the same API that used
    when reading a FieldInfos from _X.fnm (when we better not pick a
    different field number!).  In theory of course those
    field numbers will never conflict and we'll always get our
    preferred field number... but still.  Maybe add an assert in
    FieldInfos.read() verifying we always get that field number?

  * The call to localFieldInfos.setIfNotSet in FieldInfos.addInternal
    makes me nervous... is it actually possible for it to already be
    set, to a conflicting binding?  Shouldn't it always match? (Ie,
    above, in addOrUpdateInternal, we just consulted the global map to
    get the binding).  Can't we assert the global binding is either
    not present (and we add it) or if it is present it "matches"?

  * Why do we have FieldInfos.clearVectors?  Nobody should call
    that...?

  * It's not great that we open then close the CFS reader inside
    SegmentInfo, just to read the FieldInfos.  Ie, this means on
    opening an SR we will open this CFS reader twice... it also means
    that opening a SegmentInfos is quite a bit more costly than it
    used to be.  EG creating an IW must now go and open/close a CFS reader
    per-segment... not sure what we can really do about that
    though... maybe, we should store the FieldInfos inside the
    segments file?  Hmmm....

  * Shouldn't IndexWriter.getFieldInfos(SegmentInfo) use the
    SegmentInfo's fieldInfos rather than loading it again from the
    directory...?

Indentation is also off in various places, for us lonely people who
still use Emacs ;)


> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: LUCENE-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch,
lucene-2881.patch, lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming
/ ordering. IW carries FI instances over from previous segments which also carries over field
properties like isIndexed etc. While having consistent field ordering per IW session appears
to be important due to bulk merging stored fields etc. carrying over other properties might
become problematic with Lucene's Codec support.  Codecs that rely on consistent properties
in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field
(using the field id within the file name). Yet, if a segment has no DocValues indexed in a
particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues
will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors
or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo
would be managed per Segment and Codec such that its properties are valid per segment. It
also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just
per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message