lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simon Willnauer (JIRA)" <j...@apache.org>
Subject [jira] Updated: (LUCENE-2742) Enable native per-field codec support
Date Tue, 09 Nov 2010 09:26:09 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-2742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simon Willnauer updated LUCENE-2742:
------------------------------------

    Attachment: LUCENE-2742.patch

Here is a first patch - all tests pass. I changed the CodecProvider interface slightly to
be able to hold perField codecs as well as a default perField codec. For simplicity users
can not register their codec directly though the Fieldable interface. Internally I added a
CodecInfo which handles all the ordering and registration per segment / field. For consistency
I bound CodecInfo to FieldInfos since we are now operating per field. A codec can only be
assigned once, at the first time we see the codec during FieldInfos creation. 

there is a nocommit on Fieldable since it doesn't have javadoc but lets iterate first to see
if we wanna go that path - it seems close. 


> Enable native per-field codec support 
> --------------------------------------
>
>                 Key: LUCENE-2742
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2742
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index, Store
>    Affects Versions: 4.0
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>             Fix For: 4.0
>
>         Attachments: LUCENE-2742.patch
>
>
> Currently the codec name is stored for every segment and PerFieldCodecWrapper is used
to enable codecs per fields which has recently brought up some issues (LUCENE-2740 and LUCENE-2741).
When a codec name is stored lucene does not respect the actual codec used to encode a fields
postings but rather the "top-level" Codec in such a case the name of the top-level codec is
 "PerField" instead of "Pulsing" or "Standard" etc. The way this composite pattern works make
the indexing part of codecs simpler but also limits its capabilities. By recoding the top-level
codec in the segments file we rely on the user to "configure" the PerFieldCodecWrapper correctly
to open a SegmentReader. If a fields codec has changed in the meanwhile we won't be able to
open the segment.
> The issues LUCENE-2741 and LUCENE-2740 are actually closely related to the way PFCW is
implemented right now. PFCW blindly creates codecs per field on request and at the same time
doesn't have any control over the file naming nor if a two codec instances are created for
two distinct fields even if the codec instance is the same. If so FieldsConsumer will throw
an exception since the files it relies on are already created.
> Having PerFieldCodecWrapper AND a CodecProvider overcomplicates things IMO. In order
to use per field codec a user should on the one hand register its custom codecs AND needs
to build a PFCW which needs to be maintained in the "user-land" an must not change incompatible
once a new IW of IR is created. What I would expect from Lucene is to enable me to register
a codec in a provider and then tell the Field which codec it should use for indexing. For
reading lucene should determ the codec automatically once a segment is opened. if the codec
is not available in the provider that is a different story. Once we instantiate the composite
codec in SegmentsReader we only have the codecs which are really used in this segment for
free which in turn solves LUCENE-2740. 
> Yet, instead of relying on the user to configure PFCW I suggest to move composite codec
functionality inside the core an record the distinct codecs per segment in the segments info.
We only really need the distinct codecs used in that segment since the codec instance should
be reused to prevent additional files to be created. Lets say we have the follwing codec mapping
:
> {noformat}
> field_a:Pulsing
> field_b:Standard
> field_c:Pulsing
> {noformat}
> then we create the following mapping:
> {noformat}
> SegmentInfo:
> [Pulsing, Standard]
> PerField:
> [field_a:0, field_b:1, field_c:0]
> {noformat}
> that way we can easily determ which codec is used for which field an build the composite
- codec internally on opening SegmentsReader. This ordering has another advantage, if like
in that case pulsing and standard use really the same type of files we need a way to distinguish
the used files per codec within a segment. We can in turn pass the codec's ord (implicit in
the SegmentInfo) to the FieldConsumer on creation to create files with segmentname_ord.ext
(or something similar). This solvel LUCENE-2741). 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message