lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Burton-West, Tom" <tburt...@umich.edu>
Subject RE: TermsCompoment + Dist. Search + Large Index + HEAP SPACE
Date Tue, 26 Apr 2011 15:27:20 GMT
Don't know your use case, but if you just want a list of the 400 most common words you can
use the lucene contrib. HighFreqTerms.java with the - t flag.  You have to point it at your
lucene index.  You also probably don't want Solr to be running and want to give the JVM running
HighFreqTerms a lot of memory.

http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/misc/src/java/org/apache/lucene/misc/HighFreqTerms.java?view=log

Tom
http://www.hathitrust.org/blogs/large-scale-search

-----Original Message-----
From: mdz-munich [mailto:sebastian.lutze@bsb-muenchen.de] 
Sent: Tuesday, April 26, 2011 9:29 AM
To: solr-user@lucene.apache.org
Subject: TermsCompoment + Dist. Search + Large Index + HEAP SPACE

Hi!

We've got one index splitted into 4 shards รก 70.000 records of large
full-text data from (very dirty) OCR. Thus we got a lot of "unique" terms. 
No we try to obtain the first 400 most common words for "CommonGramsFilter"
via TermsComponent but the request runs allways out of memory. The VM is
equipped with 32 GB of RAM, 16-26 GB alocated to the Java-VM. 

Any Ideas how to get the most common terms without increasing VMs Memory?   
 
Thanks & best regards,

Sebastian 

--
View this message in context: http://lucene.472066.n3.nabble.com/TermsCompoment-Dist-Search-Large-Index-HEAP-SPACE-tp2865609p2865609.html
Sent from the Solr - User mailing list archive at Nabble.com.
Mime
View raw message