lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée (JIRA) <j...@apache.org>
Subject [jira] Updated: (LUCENE-662) Extendable writer and reader of field data
Date Sat, 03 Mar 2007 20:47:51 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Nicolas Lalevée updated LUCENE-662:
-----------------------------------

    Attachment: indexFormat.patch

Patch update: synchronized with the trunk and new features.

* The index format has now an ID which is serialized in a new file in the directory. This
new file is managed by the SegmentInfos class. It has been pushed in a new file to keep me
from breaking things, but it may be pushed in the segment file. This new feature will help
to avoid opening index with wrong code. Like the index version, if the index format is not
compatible, opening it fails. And it also fails while trying to use IndexWriter#addIndexes().
This compatibilities issues are managed by the implementations of the index format: an implementation
have to implement the function canRead(String indexFmtID). But I think something is still
missing in this design. Saying that a format is compatible is another one is OK, but I have
to figured out if this is really possible to make a reader which handle two different formats.

* When synchronizing with the trunk, I had trouble with the new FieldSelectorResult : SIZE.
This new feature expect the FieldsReader to know the size of the content of the field. With
the generic FieldReader, the data is only a sequence of byte, so it cannot compute the size
of the decoded data. I did a dumb implementation: it returns the size of the data in bytes.
I know this is wrong, the associated tests fail (I let it fails in the patch). It has to be
fixed, this may require some change in the API I have designed.

* There was a discussion in java-dev about changing the order of the postings. Today in the
.frq file, the document numbers are ordered by document number. The proposal was to order
them by frequency. So I worked a little bit on the mechanism I have done to generify the field
storing, and applied it to posting storing. This part of the patch proposed here is not well
(nearly not at all) documented and is a draft. But it works (at least with the actual implementation),
the mechanism allow to implement a custom PostingReader, PostingWritter :

public interface PostingWriter {
  public void close() throws IOException;
  public long[] getPointers();
  public int getNbPointer();
  public long writeSkip(RAMOutputStream skipBuffer) throws IOException;
  public void write(int doc, int lastDoc, int nbPos, int[] positions) throws IOException;
}

public interface PostingReader {
  public void close() throws IOException ;
  public TermDocs termDocs(BitVector deletedDocs, TermInfosReader tis, FieldInfos fieldInfos)
throws IOException;
  public TermPositions termPositions(BitVector deletedDocs, TermInfosReader tis, FieldInfos
fieldInfos) throws IOException;
}

Furthermore this "generification" also allows an implementation invoked many times : http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
Note that it does not break the actual format. The .tis file is still managed internaly by
Lucene and it holds pointers to some external files (managed by the indexFormat). The implementation
of the PostingReader/PostingWriter specify how many pointers there are. The default one is
2 : .frq and .prx. The FlexibleIndexing would be 1.

* To show that the default implementation of the index format can be changed, I have created
a new package org.apache.lucene.index.impl which holds the actual index format :
- DefaultFieldData : the data part of Field
- DefaultFieldsReader : the non-generified part of the FieldsReader
- DefaultFieldsWriter : the non-generified part of the FieldsWriter
- DefaultIndexFormat : the factory of readers and writers
- DefaultPostringReader : just instanciate SegmentTermDocs and SegmentTermPositions
- DefaultPostringWriter : the posting writing part of DocumentWriter
- SegmentTermDocs : just moved
- SegmentTermPositions : just moved

* Where I want to continue: I am mainly interested in the generic field storage, so I will
continue to maintain it, I will try to fix the SIZE issue and will work about allowing readers
being compatible with each other. I am also interested in some generic index storing for facetted
search. But I figured out that the indexed data have to be stored at the document level. And
this cannot be done with postings. So I don't think I will go further in playing with postings.
I prefer look at LUCENE-584.


> Extendable writer and reader of field data
> ------------------------------------------
>
>                 Key: LUCENE-662
>                 URL: https://issues.apache.org/jira/browse/LUCENE-662
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Store
>            Reporter: Nicolas Lalevée
>            Priority: Minor
>         Attachments: entrytable.patch, generic-fieldIO-2.patch, generic-fieldIO-3.patch,
generic-fieldIO-4.patch, generic-fieldIO-5.patch, generic-fieldIO.patch, indexFormat.patch
>
>
> As discussed on the dev mailing list, I have modified Lucene to allow to define how the
data of a field is writen and read in the index.
> Basically, I have introduced the notion of IndexFormat. It is in fact a factory of FieldsWriter
and FieldsReader. So the IndexReader, the indexWriter and the SegmentMerger are using this
factory and not doing a "new FieldsReader/Writer()".
> I have also introduced the notion of FieldData. It handles every data of a field, and
also the writing and the reading in a stream. I have done this way because in the current
design of Lucene, Fiedable is an interface, so methods with a protected or package visibility
cannot be defined.
> A FieldsWriter just writes data into a stream via the FieldData of the field.
> A FieldsReader instanciates a FieldData depending on the field name. Then it use the
field data to read the stream. And finnaly it instanciates a Field with the field data.
> About compatibility, I think it is kept, as I have writen a DefaultIndexFormat that provides
some DefaultFieldsWriter and DefaultFieldsReader. These implementations do the exact job that
is done today.
> To acheive this modification, some classes and methods had to be moved from private and/or
final to public or protected.
> About the lazy fields, I have implemented them in a more general way in the implementation
of the abstract class FieldData, so it will be totally transparent for the Lucene user that
will extends FieldData. The stream is kept in the fieldData and used as soon as the stringValue
(or something else) is called. Implementing this way allowed me to handle the recently introduced
LOAD_FOR_MERGE; it is just a lazy field data, and when read() is called on this lazy field
data, the saved input stream is directly copied in the output stream.
> I have a last issue with this patch. The current design allow to read an index in an
old format, and just do a writer.addIndexes() into a new format. With the new design, you
cannot, because the writer will use the FieldData.write provided by the reader.
> enjoy !

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message