lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: How might one search for dupe IDs other than faceting on the ID field?
Date Tue, 30 Jul 2013 20:14:04 GMT
The Solr SignatureUpdateProcessorFactory is designed to facilitate dedupe... 
any particular reason you did not use it?

See:
http://wiki.apache.org/solr/Deduplication

and

https://cwiki.apache.org/confluence/display/solr/De-Duplication

And I give a bunch of examples in my book.

-- Jack Krupansky

-----Original Message----- 
From: Dotan Cohen
Sent: Tuesday, July 30, 2013 2:16 PM
To: solr-user@lucene.apache.org
Subject: How might one search for dupe IDs other than faceting on the ID 
field?

To search for duplicate IDs, I am running the following query:
select?q=*:*&facet=true&facet.field=id&rows=0

However, since upgrading from Solr 4.1 to Solr 4.3 I am receiving
OutOfMemoryError errors instead of the desired facet:

<response><lst name="error"><str
name="msg">java.lang.OutOfMemoryError: Java heap space</str><str
name="trace">java.lang.RuntimeException: java.lang.OutOfMemoryError:
Java heap space
    at 
org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilter.java:670)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:380)
    at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at ...

Might there be a less resource-intensive way to get this information.
This is Solr 4.3 running on Ubuntu Server 12.04 in Jetty. The index
has over 100,000,000 small records, for a total of about 95 GiB of
disk space, with Solr running on it's own disk. Actually, the 'disk'
is an Amazon Web Service EBS volume.

-- 
Dotan Cohen

http://gibberish.co.il
http://what-is-what.com 


Mime
View raw message