Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 608A6200AF7 for ; Tue, 31 May 2016 00:28:53 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 5F2D9160A3C; Mon, 30 May 2016 22:28:53 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A759B160A19 for ; Tue, 31 May 2016 00:28:52 +0200 (CEST) Received: (qmail 49785 invoked by uid 500); 30 May 2016 22:28:51 -0000 Mailing-List: contact dev-help@commons.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: "Commons Developers List" Delivered-To: mailing list dev@commons.apache.org Received: (qmail 49773 invoked by uid 99); 30 May 2016 22:28:51 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 May 2016 22:28:51 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 0FFB5C0E0E for ; Mon, 30 May 2016 22:28:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.979 X-Spam-Level: * X-Spam-Status: No, score=1.979 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=bargr-net.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 9vrQQIBQUudS for ; Mon, 30 May 2016 22:28:50 +0000 (UTC) Received: from mail-it0-f49.google.com (mail-it0-f49.google.com [209.85.214.49]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id A6DB25FACF for ; Mon, 30 May 2016 22:28:49 +0000 (UTC) Received: by mail-it0-f49.google.com with SMTP id z123so40143269itg.0 for ; Mon, 30 May 2016 15:28:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bargr-net.20150623.gappssmtp.com; s=20150623; h=mime-version:date:message-id:subject:from:to; bh=/vFlKgc85Iw6XxXEYJQQmqGyY27bIF0B5bXH2tjb+Mg=; b=HqQxwuYTUjobFbZPt+LQRE5F62DCeUwr/3DMX9dVlGZtiGOZYm4/L5yM/gZukMLuU1 43Oy53ZCRBXM4aqF+f8w4U3vzzNroEdJkk0h+fhEOimZa5B/FHCBwe8PRnGok4x1gjhw EQ6QWSB8ALD8HgzVoGE38c4FbXqaCp3gO/3oIBSeXFvYa9GZe76k1SpiL5X2qOLtKG0P u+r2XClHZYB3c0tQtdtGL8yCwCQfQhBR6UjgfSovltUgLb+zmJ4Q2TpN6506l72Ujzcy bZwGtaKLhUpdudKcR2sQEa7fCjt/5MzLHXLEahj2sd0DxxDNaxFw3Do6ebRM52SZK10+ Yt4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to; bh=/vFlKgc85Iw6XxXEYJQQmqGyY27bIF0B5bXH2tjb+Mg=; b=EF3XQjZ9xTs27nRI8KQFe6jSyk///2N/ydYZ9onBuylnoKFGBjVjUfdA22afb2dxFr KZvVNlTCQVQ2j9+v5soGoLzxhcgYbJ1X+ufrMCdheSoyy31hXwW5rbCoUwIyfnyX/Hpn QRzF2Hztig/MvyyHTWQkY3tJ/OD9njV655iqaBOtoZ/PllcMVl5S4qhMTXpazpgMQAtA Htly6Q+bZ2dI8YQJzMXJVGSYJ5vR3b2m7/v/zVMK3M7m1o8J4YEefZjfHYMzB3JTeJyR F+AV0h0fgeKgUxYGKRT6r2sVcjGA/AJ33LYoLcdTIW/3g6zklnX+addEduUmsTjjx9wL Tqvg== X-Gm-Message-State: ALyK8tKHBeVlUav0LjbzoyP7JreEq2abjRqFS2D8z4GNzMTjE6Bkty12sD28nr9PGcO7/BruEtjako0JWwA3xg== MIME-Version: 1.0 X-Received: by 10.36.149.215 with SMTP id m206mr10415258itd.20.1464647328815; Mon, 30 May 2016 15:28:48 -0700 (PDT) Received: by 10.36.50.87 with HTTP; Mon, 30 May 2016 15:28:48 -0700 (PDT) Date: Tue, 31 May 2016 01:28:48 +0300 Message-ID: Subject: [Math] MATH-1371: Provide accelerated kmeans++ implementation From: Artem Barger To: Commons Developers List Content-Type: multipart/alternative; boundary=94eb2c05e21ac1814f053416c721 archived-at: Mon, 30 May 2016 22:28:53 -0000 --94eb2c05e21ac1814f053416c721 Content-Type: text/plain; charset=UTF-8 Hi, I've used out of the box current KMeansPlusPlusClusterer implementation provided by CM, however saw that it doesn't scales well on large data volumes. One of the proposals to improve current implementation was submitted in JIRA-1330 is to provide support for sparse big data, i.e. make clustering algorithm to work w/ SparseVectors. While working on 1330, I've bumped into: Elkan, Charles. "Using the triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper which described additional possibility of scaling up performance of kmeans algorithm. Method based on usage of triangle inequality to avoid unnecessary distance computations cause by small movement of the cluster center which doesn't affect assignment of given point to the cluster. Simply tests using PerfTestUtils shows that using Elkan's algorithm over the standard provided currently in CM could achieve performance boost in order of magnitude for significantly large inputs. I've opened MATH-1371 "Provide accelerated kmeans++ implementation" https://issues.apache.org/jira/browse/MATH-1371 and attached my implementation there. Will be glad to receive review and comments about it. I believe that switching to this algorithm instead of regular one could improve quality of CM provided solution for kmeans. Best regards, Artem Barger. --94eb2c05e21ac1814f053416c721--