mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAHOUT-1771) Cluster dumper omits indices and 0 elements for dense vector or sparse containing 0s
Date Tue, 08 Sep 2015 14:59:46 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-1771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14734942#comment-14734942
] 

ASF GitHub Bot commented on MAHOUT-1771:
----------------------------------------

GitHub user srowen opened a pull request:

    https://github.com/apache/mahout/pull/158

    MAHOUT-1771 Cluster dumper omits indices and 0 elements for dense vector or sparse containing
0s

    Output indices in cluster representation whenever *any* vector has *some* zero elements
that won't be output.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/srowen/mahout MAHOUT-1771

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/mahout/pull/158.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #158
    
----
commit a167e13bd9d420c291fbcd8c28cffafe04dc9a4c
Author: Sean Owen <sowen@cloudera.com>
Date:   2015-09-08T14:58:26Z

    Output indices in cluster representation whenever *any* vector has *some* zero elements
that won't be output.

----


> Cluster dumper omits indices and 0 elements for dense vector or sparse containing 0s
> ------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1771
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1771
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering, mrlegacy
>    Affects Versions: 0.9
>            Reporter: Sean Owen
>            Priority: Minor
>         Attachments: MAHOUT-1771.patch
>
>
> (EDIT: fixed incorrect analysis)
> Blast from the past -- are patches still being accepted for "mrlegacy" code? Something
turned up incidentally when working with a customer that looks like a minor bug in the cluster
dumper code.
> In {{AbstractCluster.java}}:
> {code}
> public static List<Object> formatVectorAsJson(Vector v, String[] bindings) throws
IOException {
>     boolean hasBindings = bindings != null;
>     boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != v.size();
>     // we assume sequential access in the output
>     Vector provider = v.isSequentialAccess() ? v : new SequentialAccessSparseVector(v);
>     List<Object> terms = new LinkedList<>();
>     String term = "";
>     for (Element elem : provider.nonZeroes()) {
>       if (hasBindings && bindings.length >= elem.index() + 1 && bindings[elem.index()]
!= null) {
>         term = bindings[elem.index()];
>       } else if (hasBindings || isSparse) {
>         term = String.valueOf(elem.index());
>       }
>       Map<String, Object> term_entry = new HashMap<>();
>       double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
>       if (hasBindings || isSparse) {
>         term_entry.put(term, roundedWeight);
>         terms.add(term_entry);
>       } else {
>         terms.add(roundedWeight);
>       }
>     }
>     return terms;
>   }
> {code}
> The problem is that this never outputs any elements of a vector with value 0, but, also
doesn't print indices in some cases. This means the output is smaller than the number of dimensions,
but it's not possible to know where the omitted 0s are.
> It will not output indices if the vector is a dense vector, or if the number of non-default
elements is the same as the size (which includes sparse vectors even containing 0 values,
if they have been set explicitly). However the iteration is over non-zero elements only. 
> The fix seems to be to print indices if the number of _non-zero_ elements is less than
size, for _any_ vector:
> {code}
>     boolean isSparse = v.getNumZeroElements() != v.size();
> {code}
> Pretty straightforward, and minor, but wanted to check with everyone before making a
change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message