lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Boris Aleksandrovsky" <balek...@gmail.com>
Subject Re: word frequency list?
Date Thu, 31 Aug 2006 21:27:56 GMT
Jason,

You can look here:

http://www.cs.ualberta.ca/~lindek/downloads.htm

for

Word frequency counts from a 1.5B word corpus (TREC disks 1-5 and the Reuters
corpus <http://about.reuters.com/researchandstandards/corpus/>). The words
are normalized as follows: ALL CAP words are prepended with a_ and
Capitalized words are prepended with c_ after downcasing. Digits are all
replaced with 0.

Cheers,
Boris

On 8/30/06, Jason Pump <jpump@mindspring.com> wrote:
>
> Is there a large list of words and their frequency in the english
> language? Obviously it would differ by corpus but I would like to see
> what's already available.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Thanks,

Boris

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message