lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Bob Carpenter <c...@alias-i.com>
Subject Re: word frequency list? (all our n-grams are belong to you)
Date Tue, 21 Nov 2006 23:25:22 GMT
Jason Pump wrote:
> Is there a large list of words and their frequency in the english 
> language? Obviously it would differ by corpus but I would like to see 
> what's already available.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

These will vary by corpus, but the best you can do now
for English on the web is Google's corpus.  Here's
their blog entry describing it:

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

It's collected over 1,024,908,267,229 words,
with 13,588,391 distinct words that appeared
at least 200 times.  It also includes n-grams
up to order 5.  (It's on 6 compressed DVDs!)

Hats off to Alex Franz and Thorsten Brants
for releasing this (they're two computational
linguistics researchers at Google).

Here's where to get it:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

- Bob Carpenter
   Alias-i

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message