Jason Pump wrote:
> Is there a large list of words and their frequency in the english
> language? Obviously it would differ by corpus but I would like to see
> what's already available.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
These will vary by corpus, but the best you can do now
for English on the web is Google's corpus. Here's
their blog entry describing it:
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
It's collected over 1,024,908,267,229 words,
with 13,588,391 distinct words that appeared
at least 200 times. It also includes n-grams
up to order 5. (It's on 6 compressed DVDs!)
Hats off to Alex Franz and Thorsten Brants
for releasing this (they're two computational
linguistics researchers at Google).
Here's where to get it:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
- Bob Carpenter
Alias-i
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
|