spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Francis Lau <francis....@smartsheet.com>
Subject Re: How to specify column type when saving DataFrame as parquet file?
Date Fri, 14 Aug 2015 16:03:41 GMT
Jyun Fan

Here is how I have been doing it. I found that I needed to define the
schema when loading the JSON file first

Francis

import datetime
from pyspark.sql.types import *

# Define schema
upSchema = StructType([
  StructField("field 1", StringType(), True),
  StructField("field 2", LongType(), True),
  StructField("field 3", TimestampType(), True),
  StructField("field 4", DoubleType(), True)
  ])

# Load JSON file with schema
filePath = "YourData.json"
DF = sqlContext.read.schema(upSchema).json(filePath)

# Save to Parquet
savePath = "ConvertedData.parquet"

# adjust repartition number below based on size of data, I try to keep
parquet files to
# be under 500 MB and avoid many small files as well i.e. hundreds of 10 MB
files
DF.repartition(1).write.parquet(savePath)

On Fri, Aug 14, 2015 at 7:29 AM, Raghavendra Pandey <
raghavendra.pandey@gmail.com> wrote:

> I think you can try dataFrame create api that takes RDD[Row] and Struct
> type...
> On Aug 11, 2015 4:28 PM, "Jyun-Fan Tsai" <jftsai@appier.com> wrote:
>
>> Hi all,
>> I'm using Spark 1.4.1.  I create a DataFrame from json file.  There is
>> a column C that all values are null in the json file.  I found that
>> the datatype of column C in the created DataFrame is string.  However,
>> I would like to specify the column as Long when saving it as parquet
>> file.  What should I do to specify the column type when saving parquet
>> file?
>>
>> Thank you,
>> Jyun-Fan Tsai
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>


-- 
*Francis Lau* | *Smartsheet*
Senior Director of Product Intelligence
*c* 425-830-3889 (call/text)
francis.lau@smartsheet.com <jason.teravest@smartsheet.com>

Mime
View raw message