Return-Path: X-Original-To: apmail-incubator-general-archive@www.apache.org Delivered-To: apmail-incubator-general-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 690E910C63 for ; Thu, 3 Apr 2014 05:50:18 +0000 (UTC) Received: (qmail 40221 invoked by uid 500); 3 Apr 2014 05:49:59 -0000 Delivered-To: apmail-incubator-general-archive@incubator.apache.org Received: (qmail 38863 invoked by uid 500); 3 Apr 2014 05:49:52 -0000 Mailing-List: contact general-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@incubator.apache.org Delivered-To: mailing list general@incubator.apache.org Received: (qmail 38838 invoked by uid 99); 3 Apr 2014 05:49:48 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 05:49:48 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shaposhnik@gmail.com designates 209.85.214.179 as permitted sender) Received: from [209.85.214.179] (HELO mail-ob0-f179.google.com) (209.85.214.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Apr 2014 05:49:43 +0000 Received: by mail-ob0-f179.google.com with SMTP id va2so1434018obc.24 for ; Wed, 02 Apr 2014 22:49:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:content-type:content-transfer-encoding; bh=XtWosKOHjTEV0TFk/79upscJdX15XKecc93tMXxhegE=; b=lsMXrtBPOpUci4gT1PKydG5ZhFEqffZdt/xRHfXpmAqPdLeTnzYNPd1vMvaoRnaEj2 E+UtjwhfRD5onx4KvLZAniHhnNlDOmVKQh3is+1po27Gnn+6OneHNzVCkHV7cYG84W3b LZhgyShuQfOiGZyOWQ5rWTnJaThtAtQ1BgElbp3GffqF3Bz1sS9JKUUz5nB5kJd593Qu gEkqdGaa0z7CA2Fwa0xGSj/OEsRG4gws1ywFBsmDP5PS9tNFxcshJlldqOYZQTMyCZG1 +31S/ZIt09cs/yez5voE8wOwiuMrt42Qo2q2fouJO2VO0UfBosGSRZ5Ykx8yPRy1qIka NGSg== MIME-Version: 1.0 X-Received: by 10.60.146.201 with SMTP id te9mr4978222oeb.38.1396504163280; Wed, 02 Apr 2014 22:49:23 -0700 (PDT) Sender: shaposhnik@gmail.com Received: by 10.182.133.101 with HTTP; Wed, 2 Apr 2014 22:49:23 -0700 (PDT) In-Reply-To: References: Date: Wed, 2 Apr 2014 22:49:23 -0700 X-Google-Sender-Auth: pGY1HFF-vIALU4mvRN0wmiRjVJM Message-ID: Subject: Re: [DISCUSS] Proposal for a Black Duck POC From: Roman Shaposhnik To: "general@incubator.apache.org" Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org On Mon, Mar 31, 2014 at 2:13 AM, Rob Vesse wrote: > Roman > > Black Duck software certainly have a useful platform though it would be > useful to know what they are considering using for the POC. I think it would be fair to say that they are looking at us to set requirements for this POC. This doesn't seem to be a drive-by software donation on their part, but rather a genuine interest in seeing how their software *and* services can be leveraged by ASF. > I would certainly recommend trying a POC but I=B9m not sure it is > necessarily something you=B9d want to impose on all incoming projects in = the > long term. Indeed. The proof is very much in the pudding. Personally I'm curious and willing to collaborate with BlackDuck. And it looks like at least you and Jim fall into the same category. It'll be fun! > My main concerns are that Protex while very useful is somewhat dumb > primarily due to the quality of its knowledge base. For those who aren= =B9t > aware essentially the tool scans the code looking for files that have > =B3signatures=B2 that match other open source/proprietary code in the > knowledge base. The open source code is scraped from all sorts of public > sites like SourceForge, GitHub, BitBucket etc. For each match that occur= s > someone has to review the match and then they can indicate whether to > exclude that match I.e. it was a false positive or to accept that match > and attribute it appropriately. > > This is great in principle because it easily spots obvious plagiarism whe= n > it occurs. The problem from my point of view is that the false positive > rate is very high and then you have to go through all the matches and > manually state whether they are valid/invalid. This ends up being very > time consuming because for each match on your code you have to review all > the possible matches to see if there actually is a genuine match and if > not then go through a process of telling the tool > > This is where the knowledge base starts to hurt you, there are lots of > projects out there which check in everything including things like > auto-generated IDE project files, build tool reports, VCS ignore files et= c > which tend to have very high similarity and get flagged up as false > positives constantly. Ideally Apache projects won=B9t themselves be > checking these things in so the chances of these getting flagged should b= e > low. > > As a more practical example I had a recent case where I was working > through an analysis on some Hadoop related code my company is considering > open sourcing which is primarily a collection of implementations of > InputFormat and OutputFormat. A good number of our code files were > flagged as potential matches and when reviewed the only similarity was > that we had the same set of imports as many other Hadoop ecosystem > projects. This is of course exacerbated by the fact that many developers > use IDEs which organise their imports! So I had to spend several hours > checking each file and ticking boxes in Protex to say that this was > original code and not plagiarised. > > I would definitely recommend carrying out a POC and seeing what people > make of it but be aware that it can be a painful and time consuming > process. Good points! Definitely worth keeping in mind. Thanks, Roman. --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscribe@incubator.apache.org For additional commands, e-mail: general-help@incubator.apache.org