lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <dave-lucene-u...@tropo.com>
Subject stop words and index size
Date Fri, 14 Jan 2005 00:03:57 GMT
Does anyone know how much stop words are supposed to affect the index size?

I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.

Thus, the index grows by 45% in this case, which I found suprising, as I 
expected it to not grow as much. I haven't dug into the details of the 
Lucene file formats but thought compression (field/term vector/sparse 
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36

-- Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message