lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-3490) Restructure codec hierarchy
Date Tue, 01 Nov 2011 02:00:32 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140810#comment-13140810
] 

Uwe Schindler commented on LUCENE-3490:
---------------------------------------

Just some comments:
The current impl is identical to how nio.Charset and nio.spi.CharsetProvider works in JDK's
NIO classes. This differs from TIKA, where no "*Provider" class is used and all Parsers are
listed in META-INF.

For nio.Charsets we generally have lots of different charset names and many of them can be
implemented by same classes with different parameters (e.g. there is a single class for all
ISO8859 charsets that just gets the table name as ctor param).

If we decide for Codecs, that the part that can read an index is always implemented by only
one class (PulsingPostingsFormat is currently the only problematic one) with a default ctor,
we can remove the whole CodecProvider interface and its implementations and instead use ServiceLoader<Codec>
and ServiceLoader<PostigsFormat> and list all codec classes in META-INF separately (and
not only the provider class). In that case the CodecLoader static helper class would initialize
the map name->Codec on init and will do lookup and list of all codec names using that map.
This would simplyfy the implementation, but would remove the possibility to e.g. encode parameters
into codec names (e.g., "pulsingLucene40" -> new PulsingPostingsFormat(new Lucene40PostingsFormat())
versus "pulsingPFOR" -> new PulsingPostingsFormat(new PFORDeltaPostingsFormat()).

Any suggestions? If we want to simplify and remove CodecProvider, I can recode this with minimal
effort. In that case, the Chicken-And-Egg problem in Lucene40Codec using PostingsFormat.forName()
would also be solved [its the same problem like the Java7-ICU-bug] :-)
                
> Restructure codec hierarchy
> ---------------------------
>
>                 Key: LUCENE-3490
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3490
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3490_SPI.patch
>
>
> Spinoff of LUCENE-2621. (Hoping we can do some of the renaming etc here in a rote way
to make progress).
> Currently Codec.java only represents a portion of the index, but there are other parts
of the index 
> (stored fields, term vectors, fieldinfos, ...) that we want under codec control. There
is also some 
> inconsistency about what a Codec is currently, for example Memory and Pulsing are really
just 
> PostingsFormats, you might just apply them to a specific field. On the other hand, PreFlex
actually
> is a Codec: it represents the Lucene 3.x index format (just not all parts yet). I imagine
we would
> like SimpleText to be the same way.
> So, I propose restructuring the classes so that we have something like:
> * CodecProvider <-- codec name to Class resolution only
> * Codec <-- represents the index format (PostingsFormat + FieldsFormat + ...)
> * PostingsFormat: this is what Codec controls today, and Codec will return one of these
for a field.
> * FieldsFormat: Stored Fields + Term Vectors + FieldInfos?
> I think for PreFlex, it doesnt make sense to expose its PostingsFormat as a 'public'
class, because preflex
> can never be per-field so there is no use in allowing you to configure PreFlex for a
specific field.
> Similarly, I think we should do the same thing for SimpleText. Nobody needs SimpleText
for production, it should
> just be a Codec where we try to make as much of the index as plain text and simple as
possible for debugging/learning/etc.
> So we don't need to expose its PostingsFormat. On the other hand, I don't think we need
Pulsing or Memory codecs,
> because its pretty silly to make your entire index use one of their PostingsFormats.
To parallel with analysis:
> PostingsFormat is like Tokenizer and Codec is like Analyzer, and we don't need Analyzers
to "show off" every Tokenizer.
> Later, once we abstract FieldInfos reading/writing out of o.a.l.index into codec control,
we can also then
> move the baked in PerFieldCodecWrapper out (it would basically be PerFieldPostingsFormat).
Privately it would
> write the ids to the file like it does today. all 3.x hairy backwards code would move
to PreflexCodec. SimpleTextCodec
> would get a plain text fieldinfos impl, etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message