commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Neidhart (JIRA)" <>
Subject [jira] [Commented] (MATH-1171) clustering implementations have unnecessary overhead
Date Mon, 28 Dec 2015 20:09:49 GMT


Thomas Neidhart commented on MATH-1171:

In commit f0943a724, I have added an example for the userguide how to cluster images with
the current API.

I did some first experiments to improve the API with a Dataset interface that provides access
to all elements to cluster without the need to create explicit Clusterable instances.

In order to make the case of image clustering efficient, it would require some more refactoring
to avoid unneeded allocations of double arrays (as usually an image is a large array or its
pixels / samples). The distance API currently only works with arrays without offset / length
arguments, thus for each pixel a separate array must be created, which is more or less the
same as creating a Clusterable.

Changing the API to support distance calculations in arrays with offsets / length parameters
would allow to create a Dataset that directly operates on the image data without creating
intermediate objects. This might be beneficial for other use-cases as well.

> clustering implementations have unnecessary overhead
> ----------------------------------------------------
>                 Key: MATH-1171
>                 URL:
>             Project: Commons Math
>          Issue Type: Bug
>            Reporter: Mark
> I want to apply clustering algorithms like KMeansPlusPlusClusterer to pictures. And creating
a point instance for each pixel is not a good idea.
> Therefore the interface should not be based on Collections, but on some interface that
provides sort of "get(index)" accessors to data that is potentially stored in a pixel array

This message was sent by Atlassian JIRA

View raw message