lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <>
Subject Re: word frequency list? (all our n-grams are belong to you)
Date Tue, 21 Nov 2006 23:25:22 GMT
Jason Pump wrote:
> Is there a large list of words and their frequency in the english 
> language? Obviously it would differ by corpus but I would like to see 
> what's already available.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

These will vary by corpus, but the best you can do now
for English on the web is Google's corpus.  Here's
their blog entry describing it:

It's collected over 1,024,908,267,229 words,
with 13,588,391 distinct words that appeared
at least 200 times.  It also includes n-grams
up to order 5.  (It's on 6 compressed DVDs!)

Hats off to Alex Franz and Thorsten Brants
for releasing this (they're two computational
linguistics researchers at Google).

Here's where to get it:

- Bob Carpenter

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message