lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Rutherglen (JIRA)" <>
Subject [jira] Commented: (LUCENE-1292) Tag Index
Date Sat, 24 May 2008 15:53:55 GMT


Jason Rutherglen commented on LUCENE-1292:

Terms that have many docs will store the docs + skiplist in blocks.  This is to avoid having
to write a large kilobyte docs + skiplist for an update that only alters some of the docs.
 Only the blocks that will be changing will be updated.  They will be appended to the transaction
log and the in memory file pointers updated.  When this transaction log reaches a certain
percentage of the size of the existing tag.tii file the whole tag.tii file will be rewritten.

When an iteration of TermEnum is being performed, the in memory alterations are consulted.
 If the a term for example no longer has any docs, the term is skipped.  The TermDocs iteration
performs the same by checking if it should be reading from the tag.tii or the tag.tlg file
for the current block.  The block skipto and iteration code is functions the same as MultiTermDocs.

The concern is the optimal number of blocks per term and the affect on skipto performance.
 Because only 2 files are involved it seems that the switching between files that may be an
issue with MultiTermDocs skipto over many segments should not be an issue.  Seeks in the same
file are faster than seeks over multiple files.  

TermInfos -->  <TermInfo>  TermCount>
TagTermInfo --> <Term, DocFreq, NumBlocks>
Term --> <PrefixLength, Suffix, FieldNum, TermNumber>
BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc>
Block --> <DocDeltas,SkipData>


Term --> <TermString>
BlockInfo --> <DocsBytesLength, SkipBytesLength,StartDoc,EndDoc>
Block --> <DocDeltas,SkipData>

> Tag Index
> ---------
>                 Key: LUCENE-1292
>                 URL:
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Index
>    Affects Versions: 2.3.1
>            Reporter: Jason Rutherglen
> The problem the tag index solves is slow field cache loading and range queries, and reindexing
an entire document to update fields that are not tokenized.  
> The tag index holds untokenized terms with a docfreq of 1 in a term dictionary like index
file.  The file also stores the docs per term, similar to LUCENE-1278.  The index also has
a transaction log and in memory index for realtime updates to the tags.  The transaction log
is periodically merged into the existing tag term dictionary index file.
> The TagIndexReader extends IndexReader and is unified with a regular index by ParallelReader.
 There is a doc id to terms skip pointer file for the IndexReader.document method.  This file
contains a pointer for looking up the terms for a document.  
> There is a higher level class that encapsulates writing a document with tag fields to
IndexWriter and TagIndexWriter.  This requires a hook into IndexWriter to coordinate doc ids
and flushing segments to disk.  
> The writer class could be as simple as:
> {code}
> public class TagIndexWriter {
>   public void add(Term term, DocIdSetIterator iterator) {
>   }
>   public void delete(Term term, DocIdSetIterator iterator) {
>   }
> }
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message