commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gilles <gil...@harfang.homelinux.org>
Subject Re: [math]: [MATH-1330] - KMeans clustering algorithm, doesn't support clustering of sparse input data.
Date Wed, 04 May 2016 15:17:40 GMT
Hello.

On Sun, 1 May 2016 02:57:59 +0300, Artem Barger wrote:
> ​Hi,​
>
>
> On Sun, May 1, 2016 at 12:25 AM, Gilles 
> <gilles@harfang.homelinux.org>
> wrote:
>
> Therefore I started to wonder why not to use RealVector
>>
>>> instead, since it has sparse implementation so I will be able to 
>>> leverage
>>> it.
>>>
>>
>> The principle is fine; but I'm wary to use "RealVector" in new code
>> since it must be refactored...
>>
>>
> ​By refactoring, do you mean MATH-765​?

Yes.

>
>
>
>> Right now using kmeans++ clustering algorithm provided by 
>> common.maths
>>> it's not doable to cluster entire wikipedia dataset or any other 
>>> huge
>>> datasets.
>>>
>>
>> Could you expand on this application?  What is the data?
>>
>>
> ​The data is tf-idf matrix of wikipedia data set produced by gensim. 
> I'm
> trying to run coreset algorithm with Spark streaming
> and I need kmeans++ for local computation.​
>
>
>>
>> [One of the things to bear in mind while testing new implementations
>> is to not loose performance on the other classes of problems (or 
>> you'll
>> take heat from some of this list's observers...).]
>>
>>
> ​Is there any performance tests or regression, so it could be used to
> validate whenever new implementation doesn't introduced
> significant regression?

Not really.

There is a class "PertTestUtils" (in the "test" part of the source code
repository) that could compare execution time between alternative
implementations of some functionality.
I think that it's reliable enough to detect significant performance 
issues.

Otherwise, there is the (now "standard"?) JMH benchmarking tool.

Regards,
Gilles


>>>
>>> On Sat, Apr 30, 2016 at 11:41 PM, Gilles 
>>> <gilles@harfang.homelinux.org>
>>> wrote:
>>>
>>> On Mon, 25 Apr 2016 15:52:03 +0300, Artem Barger wrote:
>>>>
>>>> Hi All,
>>>>>
>>>>> I'd like to provide a solution for [MATH-1330] issue. Before 
>>>>> starting I
>>>>> have a concerns regarding the possible design and the actual
>>>>> implementation.
>>>>>
>>>>> Currently all implementations of Clusterer interface expect to 
>>>>> receive
>>>>> instance of DistanceMeasure class, which used to compute distance 
>>>>> or
>>>>> metric
>>>>> between two points. Switching clustering algorithms to work with 
>>>>> Vectors
>>>>> will make this unnecessary, therefore there will be no need to 
>>>>> provide
>>>>> DistanceMeasure, since Vector class already provides methods to 
>>>>> compute
>>>>> vector norms.
>>>>>
>>>>>
>>>> I think that reasons for using "double[]" in the 
>>>> "o.a.c.m.ml.clustering"
>>>> package were:
>>>>  * simple and straightforward (fixed dimension Cartesian 
>>>> coordinates)
>>>>  * not couple it with the "o.a.c.m.linear" package whose 
>>>> "RealVector" is
>>>>    for variable size sequences of elements (and is also, 
>>>> inconsistently,
>>>>    used as a Cartesian vector, and also as column- and 
>>>> row-matrix[1])
>>>>
>>>> It is arguable adapted for a family of problems which the 
>>>> developer
>>>> probably had in mind when taking those design decisions.
>>>>
>>>> It would be interesting to know for which class of problems, the 
>>>> design
>>>> is inappropriate, in order to clarify ideas.
>>>>
>>>> The main drawback of this approach is that we will loose the 
>>>> ability to
>>>>
>>>>> control which metric to use during clustering, however the only 
>>>>> classes
>>>>> which make an implicit use of this parameters are: Clusterer and
>>>>> KmeansPlusPlusClusterer all others assumes EucledianDistance by 
>>>>> default.
>>>>>
>>>>>
>>>> There is a default indeed, but all "Clusterer" implementations use
>>>> whatever "DistanceMeasure" has been passed to the constructor.
>>>>
>>>> Assuming that "RealVector" knows how to compute the distance means 
>>>> that
>>>> users will have to implement their own subclass of "RealVector" 
>>>> and
>>>> override "getDistance(RealVector)" if they want another distance.
>>>> Alternatively, CM would have to define all these classes.
>>>>
>>>> At first sight, it does not seem the right way to go...
>>>>
>>>> One of the possible approaches is to extend DistanceMeasure 
>>>> interface to
>>>> be
>>>>
>>>>> able to compute distance between two vectors? After all it's only 
>>>>> sub
>>>>> first
>>>>> vector from the second and compute desired norm on the result.
>>>>>
>>>>>
>>>> Seems good (at first sight) but (IMHO) only if we implement a new
>>>> "CartesianVector" class unencumbered with all the cruft of 
>>>> "RealVector".
>>>>
>>>> Another possible solution is to make vector to return it's 
>>>> coordinates,
>>>>
>>>>> hence it avail us to use DistanceMeasure as is. Personally I do 
>>>>> not
>>>>> think
>>>>> this is good approach, since it will make no sense with sparse 
>>>>> vectors.
>>>>>
>>>>>
>>>> Ruled out indeed if it conflicts with your intended usage.
>>>>
>>>> Last alternative this comes to my mind is to create a set of enums 
>>>> to
>>>>
>>>>> indicate which vector norm to use to compute distances, also do 
>>>>> no think
>>>>> this is very good solution, since sounds too intrusive and might 
>>>>> break
>>>>> backward compatibility.
>>>>>
>>>>>
>>>> And forward compatibility (clustering code will have to be adapted 
>>>> if
>>>> another distance is added later).
>>>>
>>>> What do you think? Am I missing something? Is there a better 
>>>> possible way
>>>>
>>>>> to achieve the goal?
>>>>>
>>>>>
>>>> As indicated above, a practical example might help visualize the 
>>>> options.
>>>>
>>>>
>>>> Regards,
>>>> Gilles
>>>>
>>>> [1] Cf. https://issues.apache.org/jira/browse/MATH-765
>>>>
>>>>
>>>> Best regards,
>>>>>                       Artem Barger.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@commons.apache.org
For additional commands, e-mail: dev-help@commons.apache.org


Mime
View raw message