lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Busch (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2881) Track FieldInfo per segment instead of per-IW-session
Date Mon, 28 Feb 2011 07:30:37 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13000153#comment-13000153
] 

Michael Busch commented on LUCENE-2881:
---------------------------------------

I mentioned on dev that assigning the same field number across segments is best effort now
and wanted to explain in greater detail here how it works:

There is now a global fieldName <-> fieldNumber bi-map in FieldInfos, which contains
all fieldName/number pairs seen in a IndexWriter session.  It is passed into each new FieldInfos
that is created in the same IndexWriter session.

Also, when a new IndexWriter is opened, the FieldInfos of all segments are read and the global
map created - this is tested in a new unit test this issue adds.

A FieldInfos has in addition to the reference to the global map also a "private" map, which
holds all FieldInfo objects that belong to the corresponding segment (remember there's now
a 1-1 mapping SegmentInfo->FieldInfos).  

Now the fieldNumber assignment strategy works as follows:  If a new FI is added to FieldInfos,
the global map is checked for the number of that field.  If the field name hasn't been seen
before, the smallest number available in the *local* map is picked (to keep the numbers dense).
 
Otherwise, if we have seen the field before, the global number is used.  The problem now might
be, that the global number might already be taken in the local FieldInfos.  In this case the
global and local numbers for the same fieldName would differ.  This is not a problem in terms
of correctness, but could prevent that field from being efficiently bulk-merged.

With DocumentsWriterPerThreads (DWPTs) in mind I don't see how we could guarantee consistent
field numbering across DWPTs, that's why I implemented it in this "best effort" way.

Here's an example on how we can get into a situation where a field would get different numbers
in different segments:
segment_1 has fields A and B, therefore these mappings A -> 1, B -> 2.
Now in segment_2 the first field we add is C, which hasn't been seen ever before, so we pick
locally number 1 for it.  Then we add the next document which has field A, but since number
1 is already taken, it would get a different number than in segment_1.  This means A would
not get bulk merged.

Hmm, after writing this example down I'm realizing that it would be better to just always
pick the next available global field number for a new field, then, at least until we get DWPTs,
we should never get different numbers across segments, I think?  The disadvantage would be
that FieldInfos could have "gaps" in the numbers.  I implemented the current approach because
I wanted to avoid those gaps, but having them would probably not be a big deal?

> Track FieldInfo per segment instead of per-IW-session
> -----------------------------------------------------
>
>                 Key: LUCENE-2881
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2881
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: Realtime Branch, CSF branch, 4.0
>            Reporter: Simon Willnauer
>            Assignee: Michael Busch
>             Fix For: Realtime Branch, CSF branch, 4.0
>
>         Attachments: lucene-2881.patch, lucene-2881.patch, lucene-2881.patch, lucene-2881.patch,
lucene-2881.patch
>
>
> Currently FieldInfo is tracked per IW session to guarantee consistent global field-naming
/ ordering. IW carries FI instances over from previous segments which also carries over field
properties like isIndexed etc. While having consistent field ordering per IW session appears
to be important due to bulk merging stored fields etc. carrying over other properties might
become problematic with Lucene's Codec support.  Codecs that rely on consistent properties
in FI will fail if FI properties are carried over.
> The DocValuesCodec (DocValuesBranch) for instance writes files per segment and field
(using the field id within the file name). Yet, if a segment has no DocValues indexed in a
particular segment but a previous segment in the same IW session had DocValues, FieldInfo#docValues
will be true  since those values are reused from previous segments. 
> We already work around this "limitation" in SegmentInfo with properties like hasVectors
or hasProx which is really something we should manage per Codec & Segment. Ideally FieldInfo
would be managed per Segment and Codec such that its properties are valid per segment. It
also seems to be necessary to bind FieldInfoS to SegmentInfo logically since its really just
per segment metadata.  

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message