spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "yunzhi.lyz (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-6316) add a parameter for SparkContext(conf).textFile() method , support for multi-language hdfs file , e.g. "gbk"
Date Mon, 16 Mar 2015 12:55:38 GMT

    [ https://issues.apache.org/jira/browse/SPARK-6316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363161#comment-14363161
] 

yunzhi.lyz commented on SPARK-6316:
-----------------------------------

i have a try  
read  file     non UTF-8 encodings     code example:
     
    sc.hadoopFile("/inputdir", classOf[TextInputFormat], classOf[LongWritable], classOf[Text],5).map(pair
=> new String(pair._2.getBytes(), 0 , pair._2.getLength(), "gbk"))

write file non  UTF-8 encodings       code example:

    file.map(x => (NullWritable.get(), new Text(String.valueOf(x).getBytes("gbk")))).saveAsHadoopFile[TextOutputFormat[NullWritable,
Text]]("/output")

RESOLVED   sc.textFile and rdd.saveAsTextFile   not support non UTF-8 encodings  question
 


> add a parameter for  SparkContext(conf).textFile() method , support for multi-language
 hdfs file ,   e.g. "gbk"
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6316
>                 URL: https://issues.apache.org/jira/browse/SPARK-6316
>             Project: Spark
>          Issue Type: New Feature
>         Environment: linux   
> LANG=en_US.UTF-8
>            Reporter: yunzhi.lyz
>
>         add a parameter for  SparkContext(conf).textFile() method , support for multi-language
 hdfs file .
>   
>        e.g.     val file = new SparkContext(conf).textFile(args(0), 10,"gbk")
> modify the codeļ¼š
>        
>       org.apache.spark.SparkContext
>      +  def defaultEncoding: String = "utf-8"
>     
>      --   def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]
= {
>     hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
>       minPartitions).map(pair => pair._2.toString).setName(path)
>   }
>    ++    def textFile(path: String, minPartitions: Int = defaultMinPartitions,encoding:
String = defaultEncoding): RDD[String] = {
>     hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
>       minPartitions).map(pair => new String(pair._2.getBytes(), 0 , pair._2.getLength(),
encoding)).setName(path)
>   }
>    
>        
>         



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message