mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (MAHOUT-402) NamedVectors are not readily identifiable in vectordumper output
Date Wed, 19 Jan 2011 00:51:46 GMT

     [ https://issues.apache.org/jira/browse/MAHOUT-402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved MAHOUT-402.
------------------------------

       Resolution: Fixed
    Fix Version/s: 0.5
         Assignee: Drew Farris

> NamedVectors are not readily identifiable in vectordumper output
> ----------------------------------------------------------------
>
>                 Key: MAHOUT-402
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-402
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.4
>            Reporter: Drew Farris
>            Assignee: Drew Farris
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: MAHOUT-402.patch
>
>
> When dumping a sequence file of Writable,NamedVector using vectordumper in either JSON
or standard format, it is not apparent in the output that the vectors are indeed named vectors.
> For example, after applying MAHOUT-401 to produce NamedVectors from seq2sparse, I run:
> {code}
> ./bin/mahout vectordump -j -p -s ~/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> {code}
> And get: 
> {code}
> Input Path: /home/drew/mahout/reuters-out-seqdir-sparse/tf-vectors/part-00000
> /reut2-000.sgm-0.txt    {"class":"org.apache.mahout.math.RandomAccessSparseVector","vector"
[...]
> {code}
> or when removing the -j argument:
> {code}
> /reut2-000.sgm-0.txt    elts: {1026:3.0, 16150:1.0, 3338:3.0, 16147:1.0, 3339:1.0, 12240:1.0,
[...]
> {code}
> The first case, when dumping JSON, is due to the fact that NamedVector simply calls its
delegate's asFormatString method. Granted the naive approach of implementing asFormatString
in named vector also produces some nasty output:
> {code}
> /reut2-001.sgm-468.txt	{"class":"org.apache.mahout.math.NamedVector","vector":"{\"delegate\":{\"class\":\"org.apache.mahout.math.RandomAccessSparseVector\"
[...]
> {code}
> So a little more thought needs to be given to that approach.
> For the non-json format, VectorHelper.vectorToString(..) is the culprit. Would it be
ok to do an instanceof NamedVector here and emit the name?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message