lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Eliminate postings hash (was Re: improve how IndexWriter uses RAM...)
Date Thu, 05 Apr 2007 14:31:20 GMT

On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:

> The one thing that still baffles me is: I can't get a persistent
> Posting hash to be any faster.

Don't use a hash, then.  :)

KS doesn't.

   * Give Token a "position" member.
   * After you've got accumulated all the Tokens, calculate
     position for each token from the position increment.
   * Arrange the postings in an array sorted by position.
   * Count the number of postings in a row with identical
     text to get freq.

Relevant code from KinoSearch::Analysis::TokenBatch below.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

------------------------------------------------------------

void
TokenBatch_invert(TokenBatch *self)
{
     Token **tokens = (Token**)self->elems;
     Token **limit  = tokens + self->size;
     i32_t   token_pos = 0;

     /* thwart future attempts to append */
     if (self->inverted)
         CONFESS("TokenBatch has already been inverted");
     self->inverted = true;

     /* assign token positions */
     for ( ;tokens < limit; tokens++) {
         (*tokens)->pos = token_pos;
         token_pos += (*tokens)->pos_inc;
     }

     /* sort the tokens lexically, and hand off to cluster counting  
routine */
     qsort(self->elems, self->size, sizeof(Token*), Token_compare);
     count_clusters(self);
}

static void
count_clusters(TokenBatch *self)
{
     Token **tokens      = (Token**)self->elems;
     u32_t  *counts      = CALLOCATE(self->size + 1, u32_t);
     u32_t   i;

     /* save the cluster counts */
     self->cluster_counts_size = self->size;
     self->cluster_counts = counts;

     for (i = 0; i < self->size; ) {
         Token *const base_token = tokens[i];
         char  *const base_text  = base_token->text;
         const size_t base_len   = base_token->len;
         u32_t j = i + 1;

         /* iterate through tokens until text doesn't match */
         while (j < self->size) {
             Token *const candidate = tokens[j];

             if (   (candidate->len == base_len)
                 && (memcmp(candidate->text, base_text, base_len) == 0)
             ) {
                 j++;
             }
             else {
                 break;
             }
         }

         /* record a count at the position of the first token in the  
cluster */
         counts[i] = j - i;

         /* start the next loop at the next token we haven't seen */
         i = j;
     }
}

Token**
TokenBatch_next_cluster(TokenBatch *self, u32_t *count)
{
     Token **cluster = (Token**)(self->elems + self->cur);

     if (self->cur == self->size) {
         *count = 0;
         return NULL;
     }

     /* don't read past the end of the cluster counts array */
     if (!self->inverted)
         CONFESS("TokenBatch not yet inverted");
     if (self->cur > self->cluster_counts_size)
         CONFESS("Tokens were added after inversion");

     /* place cluster count in passed-in var, advance bookmark */
     *count = self->cluster_counts[ self->cur ];
     self->cur += *count;

     return cluster;
}





---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message