lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Kevin A. Burton" <bur...@newsmonster.org>
Subject Documents with 1 word are given unfair lengthNorm()
Date Thu, 28 Oct 2004 01:14:25 GMT
WRT to my blog post:

It seems the problem is that the distribution for lengthNorm() starts at 
1 and moves down from there.  1.0f would work but HUGE documents would 
be normalized and so would distort the results.

What would you think of using this implementation for lengthNorm:

>     public float lengthNorm( String fieldName, int numTokens ) {
>
>         int THRESHOLD = 50;
>         
>         int nt = numTokens;
>
>         if ( numTokens <= THRESHOLD )
>             ++nt;
>             
>         if ( numTokens > THRESHOLD )
>             nt -= THRESHOLD;
>         
>         float v = (float)(1.0 / Math.sqrt(nt));
>
>         if ( numTokens <= THRESHOLD )
>             v = 1 - v;
>
>         return v;
>     }
>
This starts the distribution low... approaches 1.0 when 50 terms are in 
the document... then asymptotically moves to zero from here on out based 
on sqrt.

For example with values from 1 -> 150 would yield (I'd graph this out 
but I'm too lazy):

> 1 - 0.29289323
> 2 - 0.42264974
> 3 - 0.5
> 4 - 0.5527864
> 5 - 0.5917517
> 6 - 0.6220355
> 7 - 0.6464466
> 8 - 0.6666666
> 9 - 0.6837722
> 10 - 0.69848865
> 11 - 0.7113249
> 12 - 0.72264993
> 13 - 0.73273873
> 14 - 0.74180114
> 15 - 0.75
> 16 - 0.7574644
> 17 - 0.7642977
> 18 - 0.7705843
> 19 - 0.7763932
> 20 - 0.7817821
> 21 - 0.7867993
> 22 - 0.7914856
> 23 - 0.79587585
> 24 - 0.8
> 25 - 0.80388385
> 26 - 0.8075499
> 27 - 0.81101775
> 28 - 0.81430465
> 29 - 0.81742585
> 30 - 0.8203947
> 31 - 0.8232233
> 32 - 0.82592237
> 33 - 0.8285014
> 34 - 0.83096915
> 35 - 0.8333333
> 36 - 0.83560103
> 37 - 0.83777857
> 38 - 0.8398719
> 39 - 0.8418861
> 40 - 0.84382623
> 41 - 0.8456966
> 42 - 0.8475014
> 43 - 0.84924436
> 44 - 0.8509288
> 45 - 0.852558
> 46 - 0.85413504
> 47 - 0.85566247
> 48 - 0.85714287
> 49 - 0.8585786
> 50 - 0.859972
> 51 - 1.0
> 52 - 0.70710677
> 53 - 0.57735026
> 54 - 0.5
> 55 - 0.4472136
> 56 - 0.4082483
> 57 - 0.37796447
> 58 - 0.35355338
> 59 - 0.33333334
> 60 - 0.31622776
> 61 - 0.30151135
> 62 - 0.28867513
> 63 - 0.2773501
> 64 - 0.26726124
> 65 - 0.2581989
> 66 - 0.25
> 67 - 0.24253562
> 68 - 0.23570226
> 69 - 0.22941573
> 70 - 0.2236068
> 71 - 0.2182179
> 72 - 0.21320072
> 73 - 0.2085144
> 74 - 0.20412415
> 75 - 0.2
> 76 - 0.19611613
> 77 - 0.19245009
> 78 - 0.18898223
> 79 - 0.18569534
> 80 - 0.18257418
> 81 - 0.1796053
> 82 - 0.17677669
> 83 - 0.17407766
> 84 - 0.17149858
> 85 - 0.16903085
> 86 - 0.16666667
> 87 - 0.16439898
> 88 - 0.16222142
> 89 - 0.16012815
> 90 - 0.15811388
> 91 - 0.15617377
> 92 - 0.15430336
> 93 - 0.15249857
> 94 - 0.15075567
> 95 - 0.1490712
> 96 - 0.14744195
> 97 - 0.145865
> 98 - 0.14433756
> 99 - 0.14285715
> 100 - 0.14142136
> 101 - 0.14002801
> 102 - 0.13867505
> 103 - 0.13736056
> 104 - 0.13608277
> 105 - 0.13483997
> 106 - 0.13363062
> 107 - 0.13245323
> 108 - 0.13130644
> 109 - 0.13018891
> 110 - 0.12909944
> 111 - 0.12803689
> 112 - 0.12700012
> 113 - 0.12598816
> 114 - 0.125
> 115 - 0.12403473
> 116 - 0.12309149
> 117 - 0.12216944
> 118 - 0.12126781
> 119 - 0.120385855
> 120 - 0.11952286
> 121 - 0.11867817
> 122 - 0.11785113
> 123 - 0.11704115
> 124 - 0.11624764
> 125 - 0.11547005
> 126 - 0.114707865
> 127 - 0.11396058
> 128 - 0.1132277
> 129 - 0.11250879
> 130 - 0.1118034
> 131 - 0.11111111
> 132 - 0.11043153
> 133 - 0.10976426
> 134 - 0.10910895
> 135 - 0.10846523
> 136 - 0.107832775
> 137 - 0.107211255
> 138 - 0.10660036
> 139 - 0.10599979
> 140 - 0.10540926
> 141 - 0.104828484
> 142 - 0.1042572
> 143 - 0.10369517
> 144 - 0.10314213
> 145 - 0.10259783
> 146 - 0.10206208
> 147 - 0.10153462
> 148 - 0.101015255
> 149 - 0.10050378


-- 

Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask me for an 
invite!  Also see irc.freenode.net #rojo if you want to chat.

Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html

If you're interested in RSS, Weblogs, Social Networking, etc... then you 
should work for Rojo!  If you recommend someone and we hire them you'll 
get a free iPod!
    
Kevin A. Burton, Location - San Francisco, CA
       AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message