lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From bugzi...@apache.org
Subject DO NOT REPLY [Bug 31882] - [PATCH] FuzzyTermEnum optimization and refactor
Date Tue, 26 Oct 2004 22:58:21 GMT
DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=31882>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.

http://issues.apache.org/bugzilla/show_bug.cgi?id=31882

[PATCH] FuzzyTermEnum optimization and refactor





------- Additional Comments From jhager@gmail.com  2004-10-26 22:58 -------
The "synchronized" was added to the similarity() because it uses the d[][]
across calls to it.  If similarity() was called on the SAME FuzzyTermEnum object
by two different threads, the results could not be guaranteed.  How lucene uses
FuzzyTermEnum, this will never happen.  Currently, Lucene creates it and let's
it be garbage collected within a single method.  This is more defensive
programming than anything else.

The TYPICAL_LONGEST_WORD_IN_INDEX is used in two different context.  The first
context is to initializing the d[][] to something smarter than [x][1].  It
should save very little time as d[][] will quickly grow to the appropriate size.
 The second context is to create a lookup table for the max distance, so that it
does not have to recalculate what the max distance is for every word.  The max
distance is a constant for every word of the same length.  I also only saw
marginal gain from this change.  

I went back to estimate the acutal amount of time saved from this change in
isolation.  On a modern machine, it is less than 200ms for 10 million words (see
test below).  

public class Scratch
{
  private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;
  private static final float[] TYPICAL_WORD_LENGTH_DISTRIBUTION = {
   0, 0.01f, 0.015f, 0.025f, 0.05f, 0.10f, 0.125f, 0.15f, 0.155f, 0.12f, 0.10f,
0.07f, 0.04f, 0.015f, 0.010f, 0.005f, 0.005f, 0.005f
  };

  private final int[] maxDistances = new int[TYPICAL_LONGEST_WORD_IN_INDEX];

  public Scratch()
  {
    for (int i = 0; i < maxDistances.length; i++)
    {
      maxDistances[i] = calculateMaxDistance(i);
    }
  }

  private final int getMaxDistance(int m){
    return (m < maxDistances.length) ? maxDistances[m] : calculateMaxDistance(m);
  }

  private int calculateMaxDistance(int m){
    return (int) ((1-0.5) * (Math.min(7, m) + 0));
  }

  public static void main(String[] args)
  {    
    long start = System.currentTimeMillis();
    Scratch s = new Scratch();
    final int totalNumberOfRuns = 10000000;
    for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
    {
      float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
      final int runsForASingleLength = (int) (v * totalNumberOfRuns);
      //System.out.println(runsForASingleLength);
      for (int j=0; j < runsForASingleLength; j++) {
        s.getMaxDistance(i);
      }
    }

    long total = System.currentTimeMillis() - start;

    System.out.println("Time to lookup the records (in ms): " + total);

    long start2 = System.currentTimeMillis();

    for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
    {
      float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
      final int runsForASingleLength = (int) (v * totalNumberOfRuns);
      for (int j=0; j < runsForASingleLength; j++) {
        s.calculateMaxDistance(i);
      }
    }

     long total2 = System.currentTimeMillis() - start2;

    System.out.println("Time to recalculate the records (in ms): " + total2);
  }

}

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message