Mailing-List: contact issues-help@commons.apache.org; run by ezmlm
Precedence: bulk
Reply-To: issues@commons.apache.org
Date: Fri, 26 Feb 2016 10:53:18 +0000 (UTC)
From: "Artem Barger (JIRA)" <jira@apache.org>
To: issues@commons.apache.org
Message-ID: <JIRA.12944854.1456483845000.154268.1456483998048@Atlassian.JIRA>
In-Reply-To: <JIRA.12944854.1456483845000@Atlassian.JIRA>
References: <JIRA.12944854.1456483845000@Atlassian.JIRA>
 <JIRA.12944854.1456483845557@arcas>
Subject: [jira] [Updated] (MATH-1330) KMeans clustering algorithm, doesn't
 support clustering of sparse input data.
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/MATH-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Artem Barger updated MATH-1330:
-------------------------------
    Description: 
Currently *KMeansPlusPlusClusterer* class require from generic parameter *T`* to extend from *Clusterable* interface, which is:
{quote}
public interface Clusterable {

    /**
     * Gets the n-dimensional point.
     *
     * @return the point array
     */
    double[] getPoint();
}
{quote}
i.e. returns dense representation of the clusterable data, hence making it impossible to efficiently compute kmeans clustering on big dimensional, but very sparse data. I think it will be much better if *Clusterable* interface will return a *Vector* allowing usage of *SparceVector*s while clustering the data. Of course *KMeansPlusPlusClusterer* implementation and I assume other clustering implementations should be refactored accordingly to support this.

  was:
Currently `KMeansPlusPlusClusterer` class require from generic parameter `T` to extend from `Clusterable` interface, which is:
```
public interface Clusterable {

    /**
     * Gets the n-dimensional point.
     *
     * @return the point array
     */
    double[] getPoint();
}
```
i.e. returns dense representation of the clusterable data, hence making it impossible to efficiently compute kmeans clustering on big dimensional, but very sparse data. I think it will be much better if `Clusterable` interface will return a `Vector` allowing usage of `SparceVector`s while clustering the data. Of course `KMeansPlusPlusClusterer` implementation and I assume other clustering implementations should be refactored accordingly to support this.


> KMeans clustering algorithm, doesn't support clustering of sparse input data.
> -----------------------------------------------------------------------------
>
>                 Key: MATH-1330
>                 URL: https://issues.apache.org/jira/browse/MATH-1330
>             Project: Commons Math
>          Issue Type: Improvement
>            Reporter: Artem Barger
>
> Currently *KMeansPlusPlusClusterer* class require from generic parameter *T`* to extend from *Clusterable* interface, which is:
> {quote}
> public interface Clusterable {
>     /**
>      * Gets the n-dimensional point.
>      *
>      * @return the point array
>      */
>     double[] getPoint();
> }
> {quote}
> i.e. returns dense representation of the clusterable data, hence making it impossible to efficiently compute kmeans clustering on big dimensional, but very sparse data. I think it will be much better if *Clusterable* interface will return a *Vector* allowing usage of *SparceVector*s while clustering the data. Of course *KMeansPlusPlusClusterer* implementation and I assume other clustering implementations should be refactored accordingly to support this.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)