lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Han Jiang <jiangha...@gmail.com>
Subject [GSoC] About how flexible indexing works in lucene 4.0
Date Mon, 26 Mar 2012 22:59:45 GMT
Hi all,

I was trying to figure out the control flow of IndexWriter and
IndexSearcher, in order to get a better understanding of the idea behind
Codec implementation.

However, there seem to be some questions related with codes, which I just
find inconvenient to discuss here.

Maybe it is better to expain how much I understand, and ask for your
comments?
Here is what I understand:

*Index time:
*--First of all, IndexWriter should get a Codec configuration from an
IndexWriterConfig.
--When IndexWriter.addDocument is called, an instance of
DocumentsWriterPerThread will be created,
--It then pass the codec information through the indexing chain, and make
an instance of FreqProxTermsWriterPerField to call flush().
--Then, based on the codec information, we create an instance of
TermsConsumer, after this, we iterator each termID, get corresponding
PostingConsumer, and save infomation of each document.
--Here, by inheriting "TermsConsumer" and "PostingConsumer", we get
IndexWriter create index with new posting formats.

*Query time:
*--Now, let's take Phrase Search as an example.
--When IndexSearcher.search(phraseQuery,topN) is called, an instance of
PhraseWeight will be created to wrap the query terms,
--Then, IndexSearcher will create tasks to call method
PhraseWeight.scorer(), inside which two instances: Terms and TermsEnum will
be fetched from corresponding AtomicReader,
--With the help of TermsEnum, for every phrase words, related docs and
positions will be fetched through a DocsAndPositionsEnum, and result thus
be generated.
--Here, by inheriting "TermsEnum" and related "*Enum" classes, we get
IndexSearcher(or IndexReader) understand our posting formats.

And, here I have some questions:

1. Will multiple AtomicReaders created if I operate a search on a index
with several segments? If not, when will there be multi AtomicReaders? And
to further the question,  what is the idea to introduce AtomicReader and
CompositeReader into lucene 4?

2. I must have missed something during query time, since subtype of
PostingsReaderBase is just absent from what I explained. Is it created when
an instance of AtomicReader is fetch from context? Where can I find related
codes?

3. The wiki page here <http://wiki.apache.org/lucene-java/FlexibleIndexing>
says we should provide an arbitrary skipDocs bit set during enumeration.
Then, is posting list itself remains unchanged, even if I call
deleteDocuments() ? Will deleted documents still remain in the postings
file, even segments get merged?


Thank you.


-- 
Han Jiang

EECS, Peking University, China
Every Effort Creates Smile

Senior Student

Mime
View raw message