Return-Path: X-Original-To: apmail-cassandra-dev-archive@www.apache.org Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3FAC618371 for ; Wed, 9 Dec 2015 23:26:13 +0000 (UTC) Received: (qmail 6639 invoked by uid 500); 9 Dec 2015 23:26:12 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 6606 invoked by uid 500); 9 Dec 2015 23:26:12 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 6591 invoked by uid 99); 9 Dec 2015 23:26:11 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Dec 2015 23:26:11 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 3FCD01A0AB1 for ; Wed, 9 Dec 2015 23:26:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.91 X-Spam-Level: ** X-Spam-Status: No, score=2.91 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, T_KAM_HTML_FONT_INVALID=0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id eusYDvUoH1uq for ; Wed, 9 Dec 2015 23:26:04 +0000 (UTC) Received: from mail-io0-f171.google.com (mail-io0-f171.google.com [209.85.223.171]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 694D742963 for ; Wed, 9 Dec 2015 23:26:04 +0000 (UTC) Received: by ioc74 with SMTP id 74so76915755ioc.2 for ; Wed, 09 Dec 2015 15:26:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to:content-type; bh=A+OCQsW3GdINgF4r03gjIQTNHwFxd75GRIvsr/zA/AU=; b=bUtB4ZIjL4JIUD8z6cpgHig+ckOXwByrtVuSBz/StZBL0o40hdDS/NJnfyRRET8U/D dK4y7jMx+m+CfFItfdcIEgGABs70d2Mts8v5GGdCvjOhqgJANxOJXgzvQrU3pXDwx4XH rRaEkduqm0LukrHW7SqA09rUMizkzn1+K4CJTYLPv+EcQFvKKDys60SXbTDDevw5y3bF qqq/HtZP8EH8O0wER/24nFVkV7RtrMpw+5t3G0CNh0dSx+vHjskjsd2vX4VxjlkTKKoe 7KD4/kbJTwLwKnlduIYDiVes7JnjFIkLEwMLL90FDh0ixxugSPyvgsKGf4E6tNVe0Q1i HrKg== X-Received: by 10.107.185.133 with SMTP id j127mr8799946iof.91.1449703564100; Wed, 09 Dec 2015 15:26:04 -0800 (PST) MIME-Version: 1.0 From: Igor Wiese Date: Wed, 09 Dec 2015 23:25:54 +0000 Message-ID: Subject: Feedback of my Phd work in Cassandra Project To: dev@cassandra.apache.org Content-Type: multipart/alternative; boundary=94eb2c070a1cf7c94e05267f696e --94eb2c070a1cf7c94e05267f696e Content-Type: text/plain; charset=UTF-8 Hi, Cassandra Community. My name is Igor Wiese, phd Student from Brazil. I am investigating two important questions: What makes two files change together? Can we predict when they are going to co-change again? I've tried to investigate this question on the Cassandra project. I've collected data from issue reports, discussions and commits and using some machine learning techniques to build a prediction model. I collected a total of 1197 commits in which a pair of files changed together and could correctly predict 48% commits. These were the most useful information for predicting co-changes of files: - number of lines of code added, - number of lines of code removed, - sum of number of lines of code added, modified and removed, - number of words used to describe and discuss the issues, and - median value of closeness, a social network measure obtained from issue comments. To illustrate, consider the following example from our analysis. For release 1.0, the files "cassandra/tools/NodeCmd.java" and "cassandra/tools/NodeProbe.java" changed together in 16 commits. In another 6 commits, only the first file changed, but not the second. Collecting contextual information for each commit made to first file in the previous release, we were able to predict all 13 commits in which both files changed together in release 1.0, and we only issued 2 false positives. For this pair of files, the most important contextual information was the number of lines of code added, removed and modified in each commit, the number of words used to describe and discuss the issues and the number of comments in the issues. - Do these results surprise you? Can you think in any explanation for the results? - Do you think that our rate of prediction is good enough to be used for building tool support for the software community? - Do you have any suggestion on what can be done to improve the change recommendation? You can visit our webpage to inspect the results in details: http://flosscoach.com/index.php/17-cochanges/66-cassandra All the best, Igor Wiese Phd Candidate --94eb2c070a1cf7c94e05267f696e--