lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (Commented) (JIRA)" <>
Subject [jira] [Commented] (LUCENE-3490) Restructure codec hierarchy
Date Sun, 30 Oct 2011 16:48:32 GMT


Robert Muir commented on LUCENE-3490:

Thanks Uwe, I think once we get tests passing in the branch we should look at this. Maybe
we merge it, and just recreate the branch to fix it, or
maybe we fix it before merging... doesn't matter to me.

I think this would simplify a lot of stuff in lucene too:
* tests-framework currently contains additional codecs that are used only in testing, this
would work much nicer 
  because when running tests the test codecs would get loaded.
* luke would be able to read your index, as long as you had the right stuff in classpath.
e.g. if i am debugging
  a test fail and I want to inspect the index with luke, it should all just work as the tests
codecs would be available.
* we could remove codecprovider arguments to all these reading apis (e.g. IndexReader). This
is dumb right now, 
  its almost like fixing default charset problems trying to track down how many places are
using CodecProvider.getDefault(),
  when you have some custom codecs and you only want the String->Codec mappings to be visible
so that your index can be read!
> Restructure codec hierarchy
> ---------------------------
>                 Key: LUCENE-3490
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Robert Muir
>             Fix For: 4.0
> Spinoff of LUCENE-2621. (Hoping we can do some of the renaming etc here in a rote way
to make progress).
> Currently only represents a portion of the index, but there are other parts
of the index 
> (stored fields, term vectors, fieldinfos, ...) that we want under codec control. There
is also some 
> inconsistency about what a Codec is currently, for example Memory and Pulsing are really
> PostingsFormats, you might just apply them to a specific field. On the other hand, PreFlex
> is a Codec: it represents the Lucene 3.x index format (just not all parts yet). I imagine
we would
> like SimpleText to be the same way.
> So, I propose restructuring the classes so that we have something like:
> * CodecProvider <-- codec name to Class resolution only
> * Codec <-- represents the index format (PostingsFormat + FieldsFormat + ...)
> * PostingsFormat: this is what Codec controls today, and Codec will return one of these
for a field.
> * FieldsFormat: Stored Fields + Term Vectors + FieldInfos?
> I think for PreFlex, it doesnt make sense to expose its PostingsFormat as a 'public'
class, because preflex
> can never be per-field so there is no use in allowing you to configure PreFlex for a
specific field.
> Similarly, I think we should do the same thing for SimpleText. Nobody needs SimpleText
for production, it should
> just be a Codec where we try to make as much of the index as plain text and simple as
possible for debugging/learning/etc.
> So we don't need to expose its PostingsFormat. On the other hand, I don't think we need
Pulsing or Memory codecs,
> because its pretty silly to make your entire index use one of their PostingsFormats.
To parallel with analysis:
> PostingsFormat is like Tokenizer and Codec is like Analyzer, and we don't need Analyzers
to "show off" every Tokenizer.
> Later, once we abstract FieldInfos reading/writing out of o.a.l.index into codec control,
we can also then
> move the baked in PerFieldCodecWrapper out (it would basically be PerFieldPostingsFormat).
Privately it would
> write the ids to the file like it does today. all 3.x hairy backwards code would move
to PreflexCodec. SimpleTextCodec
> would get a plain text fieldinfos impl, etc.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message