commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles <gil...@harfang.homelinux.org>
Subject Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input data.
Date Sat, 30 Apr 2016 21:25:32 GMT
Hi.

[Please avoid top-posting.]

On Sat, 30 Apr 2016 23:56:13 +0300, Artem Barger wrote:
> Well, I do not have too much examples in my mind. Actually only one
> practical use case, which made me to think that using "double[]" is 
> not
> very practical. I was trying to cluster Wikipedia dataset, which is 
> very
> sparse (a lot of zero entries) and it looked like huge waste of RAM 
> to
> store zeros.

We'll all agree on this, I expect. ;-)

> Therefore I started to wonder why not to use RealVector
> instead, since it has sparse implementation so I will be able to 
> leverage
> it.

The principle is fine; but I'm wary to use "RealVector" in new code
since it must be refactored...

> Right now using kmeans++ clustering algorithm provided by 
> common.maths
> it's not doable to cluster entire wikipedia dataset or any other huge
> datasets.

Could you expand on this application?  What is the data?

>
> Another possible alternative is to implement SparseClusterable 
> (inherits
> from Clusterable) and the sparse measure which will inherit from
> DistanceMeasure and will provide metric computation for such sparse
> representation.

IIUC it would require changing "Clusterable" anyways since its method
returns a "double[]".
But it may be OK if the final goal is worth it.

[One of the things to bear in mind while testing new implementations
is to not loose performance on the other classes of problems (or you'll
take heat from some of this list's observers...).]


Best regards,
Gilles

>
>
> Best regards,
>                       Artem Barger.
>
> On Sat, Apr 30, 2016 at 11:41 PM, Gilles 
> <gilles@harfang.homelinux.org>
> wrote:
>
>> On Mon, 25 Apr 2016 15:52:03 +0300, Artem Barger wrote:
>>
>>> Hi All,
>>>
>>> I'd like to provide a solution for [MATH-1330] issue. Before 
>>> starting I
>>> have a concerns regarding the possible design and the actual
>>> implementation.
>>>
>>> Currently all implementations of Clusterer interface expect to 
>>> receive
>>> instance of DistanceMeasure class, which used to compute distance 
>>> or
>>> metric
>>> between two points. Switching clustering algorithms to work with 
>>> Vectors
>>> will make this unnecessary, therefore there will be no need to 
>>> provide
>>> DistanceMeasure, since Vector class already provides methods to 
>>> compute
>>> vector norms.
>>>
>>
>> I think that reasons for using "double[]" in the 
>> "o.a.c.m.ml.clustering"
>> package were:
>>  * simple and straightforward (fixed dimension Cartesian 
>> coordinates)
>>  * not couple it with the "o.a.c.m.linear" package whose 
>> "RealVector" is
>>    for variable size sequences of elements (and is also, 
>> inconsistently,
>>    used as a Cartesian vector, and also as column- and 
>> row-matrix[1])
>>
>> It is arguable adapted for a family of problems which the developer
>> probably had in mind when taking those design decisions.
>>
>> It would be interesting to know for which class of problems, the 
>> design
>> is inappropriate, in order to clarify ideas.
>>
>> The main drawback of this approach is that we will loose the ability 
>> to
>>> control which metric to use during clustering, however the only 
>>> classes
>>> which make an implicit use of this parameters are: Clusterer and
>>> KmeansPlusPlusClusterer all others assumes EucledianDistance by 
>>> default.
>>>
>>
>> There is a default indeed, but all "Clusterer" implementations use
>> whatever "DistanceMeasure" has been passed to the constructor.
>>
>> Assuming that "RealVector" knows how to compute the distance means 
>> that
>> users will have to implement their own subclass of "RealVector" and
>> override "getDistance(RealVector)" if they want another distance.
>> Alternatively, CM would have to define all these classes.
>>
>> At first sight, it does not seem the right way to go...
>>
>> One of the possible approaches is to extend DistanceMeasure 
>> interface to be
>>> able to compute distance between two vectors? After all it's only 
>>> sub
>>> first
>>> vector from the second and compute desired norm on the result.
>>>
>>
>> Seems good (at first sight) but (IMHO) only if we implement a new
>> "CartesianVector" class unencumbered with all the cruft of 
>> "RealVector".
>>
>> Another possible solution is to make vector to return it's 
>> coordinates,
>>> hence it avail us to use DistanceMeasure as is. Personally I do not 
>>> think
>>> this is good approach, since it will make no sense with sparse 
>>> vectors.
>>>
>>
>> Ruled out indeed if it conflicts with your intended usage.
>>
>> Last alternative this comes to my mind is to create a set of enums 
>> to
>>> indicate which vector norm to use to compute distances, also do no 
>>> think
>>> this is very good solution, since sounds too intrusive and might 
>>> break
>>> backward compatibility.
>>>
>>
>> And forward compatibility (clustering code will have to be adapted 
>> if
>> another distance is added later).
>>
>> What do you think? Am I missing something? Is there a better 
>> possible way
>>> to achieve the goal?
>>>
>>
>> As indicated above, a practical example might help visualize the 
>> options.
>>
>>
>> Regards,
>> Gilles
>>
>> [1] Cf. https://issues.apache.org/jira/browse/MATH-765
>>
>>
>>> Best regards,
>>>                       Artem Barger.
>>>
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message