Return-Path: X-Original-To: apmail-lucene-solr-user-archive@minotaur.apache.org Delivered-To: apmail-lucene-solr-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BA9EB10323 for ; Tue, 23 Jul 2013 14:26:39 +0000 (UTC) Received: (qmail 3417 invoked by uid 500); 23 Jul 2013 14:26:35 -0000 Delivered-To: apmail-lucene-solr-user-archive@lucene.apache.org Received: (qmail 3254 invoked by uid 500); 23 Jul 2013 14:26:35 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 3243 invoked by uid 99); 23 Jul 2013 14:26:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 14:26:34 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of tommaso.teofili@gmail.com designates 209.85.192.172 as permitted sender) Received: from [209.85.192.172] (HELO mail-pd0-f172.google.com) (209.85.192.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 23 Jul 2013 14:26:29 +0000 Received: by mail-pd0-f172.google.com with SMTP id z10so8150595pdj.31 for ; Tue, 23 Jul 2013 07:26:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=2hczy+MH8zZ7xmIhy72REtBSShQ30UplUlZqrBS7tck=; b=GgBSknEkzuj2h6XnaaIRcxA6xhhwTsBgd5Xv8gew/0PXOphhaZiSd1TkLUIEXCyO5g gNGc1L6AqvL18r6AGIQCAvR+WJDqARWKv6mW4kqaHcRgy16pOSFfRrg4pPUFrHUNGU0f 5e9CN6vb/0vaUSp5u30zkE8dWau9DxVWeQOZne2X+aoDeWyyQ5St1tFFKv57a86c17q9 kHawCFMS8B2uhKVkg7yUtERciE0z3egw7ujxSJ39AauC3o6mn0p/XffPtOfDnMsdakUG ydQ03IGLAMC6ha/8GZ5BoAEFNh0THbn4O6z+LO5sSalD+OfEW5WQ/6ugDrF9WNGabXNJ zTmw== X-Received: by 10.66.171.204 with SMTP id aw12mr7869867pac.7.1374589568976; Tue, 23 Jul 2013 07:26:08 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.110.131 with HTTP; Tue, 23 Jul 2013 07:25:28 -0700 (PDT) In-Reply-To: References: From: Tommaso Teofili Date: Tue, 23 Jul 2013 16:25:28 +0200 Message-ID: Subject: Re: Document Similarity Algorithm at Solr/Lucene To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary=047d7bd6aceaf88b4104e22e9270 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bd6aceaf88b4104e22e9270 Content-Type: text/plain; charset=ISO-8859-1 if you need a specialized algorithm for detecting blogposts plagiarism / quotations (which are different tasks IMHO) I think you have 2 options: 1. implement a dedicated one based on your features / metrics / domain 2. try to fine tune an existing algorithm that is flexible enough If I were to do it with Solr I'd probably do something like: 1. index "original" blogposts in Solr (possibly using Jack's suggestion about ngrams / shingles) 2. do MLT queries with "candidate blogposts copies" text 3. get the first, say, 2-3 hits 4. mark it as quote / plagiarism 5. eventually train a classifier to help you mark other texts as quote / plagiarism HTH, Tommaso 2013/7/23 Furkan KAMACI > Actually I need a specialized algorithm. I want to use that algorithm to > detect duplicate blog posts. > > 2013/7/23 Tommaso Teofili > > > Hi, > > > > I you may leverage and / or improve MLT component [1]. > > > > HTH, > > Tommaso > > > > [1] : http://wiki.apache.org/solr/MoreLikeThis > > > > > > 2013/7/23 Furkan KAMACI > > > > > Hi; > > > > > > Sometimes a huge part of a document may exist in another document. As > > like > > > in student plagiarism or quotation of a blog post at another blog post. > > > Does Solr/Lucene or its libraries (UIMA, OpenNLP, etc.) has any class > to > > > detect it? > > > > > > --047d7bd6aceaf88b4104e22e9270--