Return-Path: Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: (qmail 24500 invoked from network); 16 Jan 2010 23:43:57 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 16 Jan 2010 23:43:57 -0000 Received: (qmail 26433 invoked by uid 500); 16 Jan 2010 23:43:55 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 26368 invoked by uid 500); 16 Jan 2010 23:43:55 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 26358 invoked by uid 99); 16 Jan 2010 23:43:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Jan 2010 23:43:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dmartin.pro@gmail.com designates 209.85.218.222 as permitted sender) Received: from [209.85.218.222] (HELO mail-bw0-f222.google.com) (209.85.218.222) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Jan 2010 23:43:45 +0000 Received: by bwz22 with SMTP id 22so300870bwz.5 for ; Sat, 16 Jan 2010 15:43:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=wIsmRCL4CwJBmYzshECZZucYn4PhcHBE85H5TbIeqyc=; b=W4zKZOkvd6VJnJJlQ2gJOi2umoeRez5nD7vIH8fDNxswz+xRt8x4P8vXYPRmyt/O3w dpK5B2gHi0cUtKyFnqkUMvVfrV3ArVWAFhyzRpArbiBQwP6CDqZCkwdtc4KFZFLY+dkH A86CohKbzGOyS38OKbo5QnUGs15pssduA0FRc= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=RvjX0nvtSqEJ0QGXCEsk+7dyM9Xfhj/OmG5TOu6Dt42Zr5FiJMqaMJZ2h0WUatjcEN r/0zRSOUyag3ecyAJV0WOdldVJm38C6tsHc6keFvFZGpIHtMC6cAmfIs1JNfdRH6TD69 pJoTSXhXRJGhkwryOvZI2T+mDGVrjGjNpBXe8= MIME-Version: 1.0 Received: by 10.204.49.82 with SMTP id u18mr2408562bkf.47.1263685404211; Sat, 16 Jan 2010 15:43:24 -0800 (PST) In-Reply-To: <27155115.post@talk.nabble.com> References: <27118977.post@talk.nabble.com> <27131969.post@talk.nabble.com> <87c998321001121839saf51aa2vff1f3a36e11b0b4d@mail.gmail.com> <27155115.post@talk.nabble.com> From: David MARTIN Date: Sun, 17 Jan 2010 00:43:04 +0100 Message-ID: <54eb108f1001161543q7689d04amda1dd5b3f748e669@mail.gmail.com> Subject: Re: Encountering a roadblock with my Solr schema design...use dedupe? To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=000325557522a03cfd047d50aff5 X-Virus-Checked: Checked by ClamAV on apache.org --000325557522a03cfd047d50aff5 Content-Type: text/plain; charset=ISO-8859-1 I'm really interested in reading the answer to this thread as my problem is rather the same. Maybe my main difference is the huge SKU number per product I may have. David On Thu, Jan 14, 2010 at 2:35 AM, Kelly Taylor wrote: > > Hoss, > > Would you suggest using dedup for my use case; and if so, do you know of a > working example I can reference? > > I don't have an issue using the patched version of Solr, but I'd much > rather > use the GA version. > > -Kelly > > > > hossman wrote: > > > > > > : Dedupe is completely the wrong word. Deduping is something else > > : entirely - it is about trying not to index the same document twice. > > > > Dedup can also certainly be used with field collapsing -- that was one of > > the initial use cases identified for the SignatureUpdateProcessorFactory > > ... you can compute an 'expensive' signature when adding a document, > index > > it, and then FieldCollapse on that signature field. > > > > This gives you "query time deduplication" based on a value computed when > > indexing (the canonical example is multiple urls refrenceing the "same" > > content but with slightly differnet boilerplate markup. You can use a > > Signature class that recognizes the boilerplate and computes an identical > > signature value for each URL whose content is "the same" but still index > > all of the URLs and their content as distinct documents ... so use cases > > where people only "distinct" URLs work using field collapse but by > default > > all matching documents can still be returned and searches on text in the > > boilerplate markup also still work. > > > > > > -Hoss > > > > > > > > -- > View this message in context: > http://old.nabble.com/Encountering-a-roadblock-with-my-Solr-schema-design...use-dedupe--tp27118977p27155115.html > Sent from the Solr - User mailing list archive at Nabble.com. > > --000325557522a03cfd047d50aff5--