mahout-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shige Takeda (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAHOUT-594) FileWriter may garble non-ASCII output if the environment variable LANG/LC_ALL is not appropriate.
Date Thu, 27 Jan 2011 00:31:45 GMT

    [ https://issues.apache.org/jira/browse/MAHOUT-594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987312#action_12987312
] 

Shige Takeda commented on MAHOUT-594:
-------------------------------------

no problem as long as this issue is addressed. thanks!

> FileWriter may garble non-ASCII output if the environment variable LANG/LC_ALL is not
appropriate.
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-594
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-594
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.4
>         Environment: RHL Linux 2.6.18
>            Reporter: Shige Takeda
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.5
>
>         Attachments: 0001-set-file-reader-and-writer-character-encoding-to-utf.patch
>
>
> For non-ASCII output data, java.io.FileWriter should be replaced with java.io.OutputStreamWriter
in UTF-8.
> For example, if you dump centroids of clusters using ClusterDumper, you may get the following
output:
> {noformat}
> ...
> C-0{n=2 c=[brown:2.099, c?t:1.957, dogs:1.916, fox:0.652, jumped:2.099, l?zy:1.884, over:2.099,
quick:2.099, red:1.916, ?:0.871, ?:0.871, ?:0.871, ?:0.871] r=[c?t:0.652, fox:0.652, l?zy:1.131,
?:0.871, ?:0.871, ?:0.871, ?:0.871]}
>     Top Terms:
>         quick                                   =>  2.0986123085021973
>         over                                    =>  2.0986123085021973
>         jumped                                  =>  2.0986123085021973
>         brown                                   =>  2.0986123085021973
>         c?t                                     =>   1.957078456878662
>         red                                     =>  1.9162907600402832
>         dogs                                    =>  1.9162907600402832
>         l?zy                                    =>  1.8843144178390503
>         ?                                       =>  0.8706584572792053
>         ?                                       =>  0.8706584572792053
>     Weight:  Point:
>     1.0: P(0) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, over:2.099, quick:2.099,
red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
>     1.0: P(1) = [brown:2.099, dogs:1.916, fox:2.609, jumped:2.099, over:2.099, quick:2.099,
red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
>     1.0: P(2) = [brown:2.099, c?t:2.609, dogs:1.916, jumped:2.099, over:2.099, quick:2.099,
red:1.916, ?:2.322, ?:2.322, ?:2.322, ?:2.322]
> ...
> {noformat}
> where "?" characters were garbled by FileWriter. NOTE: this test case is a tweaked version
of TestClusterDumper. E.g., lazy => l├Ązy
> The cause of this is the line in ClusterDumper.java:
> {code}
> Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) : new FileWriter(this.outputFile);
> {code}
> This can be around by setting the environment variables LC_ALL/LANG to en_US.UTF-8, but
many environments have LC_ALL/LANG=C by default, and in some cases, you even may not have
choices but C for various reasons.
> To address this issue, I would like to propose to hard code the output encoding to UTF-8
as follows:
> {code}
> Writer writer = this.outputFile == null ? new OutputStreamWriter(System.out) : new OutputStreamWriter(new
FileInputStream(this.outputFile), UTF8);
> {code}
> This way, the output file encoding will not be affected by environments.
> And if this proposal is agreed, a similar fix should be applied to the following files:
> - ./core/src/main/java/org/apache/mahout/classifier/sgd/ModelSerializer.java
> - ./core/src/test/java/org/apache/mahout/fpm/pfpgrowth/PFPGrowthTest.java
> - ./examples/src/main/java/org/apache/mahout/classifier/sgd/TrainLogistic.java
> - ./examples/src/main/java/org/apache/mahout/clustering/display/DisplaySpectralKMeans.java
> - ./utils/src/main/java/org/apache/mahout/clustering/lda/LDAPrintTopics.java
> - ./utils/src/main/java/org/apache/mahout/utils/SequenceFileDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/clustering/ClusterDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/VectorDumper.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/arff/Driver.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/ClusterLabels.java
> - ./utils/src/main/java/org/apache/mahout/utils/vectors/lucene/Driver.java
> Hope not many folks prefer ISO-8859-1 or other 'legacy' character sets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message