spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yana Kadiyska <yana.kadiy...@gmail.com>
Subject Re: SQLcontext changing String field to Long
Date Sun, 11 Oct 2015 19:29:55 GMT
 In our case, we do not actually need partition inference so the workaround
was easy -- instead of using the path as rootpath/batch_id=333/... we
changed the paths to rootpath/333/.... This works for us because we compute
the set of HDFS paths manually and then register a dataframe into a
SQLContext.

But it seems like there is a nicer solution:
http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Notice that the data types of the partitioning columns are
automatically inferred. Currently, numeric data types and string type
are supported. Sometimes users may not want to automatically infer the
data types of the partitioning columns. For these use cases, the
automatic type inference can be configured by
spark.sql.sources.partitionColumnTypeInference.enabled, which is
default to true. When type inference is disabled, string type will be
used for the partitioning columns

‚Äč

On Sat, Oct 10, 2015 at 9:52 PM, shobhit gupta <smartshobhu@gmail.com>
wrote:

> here is what the df.schema.toString() prints.
>
> DF Schema is ::StructType(StructField(batch_id,StringType,true))
>
> I think you nailed the problem, this filed is the part of our hdfs file
> path. We have kind of partitioned our data on the basis of batch_ids folder.
>
> How did you get around it?
>
> Thanks for help. :)
>
> On Sat, Oct 10, 2015 at 7:55 AM, Yana Kadiyska <yana.kadiyska@gmail.com>
> wrote:
>
>> can you show the output of df.printSchema? Just a guess but I think I ran
>> into something similar with a column that was part of a path in parquet.
>> E.g. we had an account_id in the parquet file data itself which was of type
>> string but we also named the files in the following manner
>> /somepath/account_id=.../file.parquet. Since Spark uses the paths for
>> partition discovery, it was actually inferring that account_id is a numeric
>> type and upon reading the data, we ran into the exception you're describing
>> (this is in Spark 1.4)..
>>
>> On Fri, Oct 9, 2015 at 7:55 PM, Abhisheks <smartshobhu@gmail.com> wrote:
>>
>>> Hi there,
>>>
>>> I have saved my records in to parquet format and am using Spark1.5. But
>>> when
>>> I try to fetch the columns it throws exception*
>>> java.lang.ClassCastException: java.lang.Long cannot be cast to
>>> org.apache.spark.unsafe.types.UTF8String*.
>>>
>>> This filed is saved as String while writing parquet. so here is the
>>> sample
>>> code and output for the same..
>>>
>>> logger.info("troubling thing is ::" +
>>> sqlContext.sql(fileSelectQuery).schema().toString());
>>> DataFrame df= sqlContext.sql(fileSelectQuery);
>>> JavaRDD<Row> rdd2 = df.toJavaRDD();
>>>
>>> First Line in the code (Logger) prints this:
>>> troubling thing is ::StructType(StructField(batch_id,StringType,true))
>>>
>>> But the moment after it the execption comes up.
>>>
>>> Any idea why it is treating the filed as Long? (yeah one unique thing
>>> about
>>> column is it is always a number e.g. Time-stamp).
>>>
>>> Any help is appreciated.
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/SQLcontext-changing-String-field-to-Long-tp25005.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>
>
> --
>
>
>
>
> *Regards , Shobhit Gupta.*
> *"If you salute your job, you have to salute nobody. But if you pollute
> your job, you have to salute everybody..!!"*
>

Mime
View raw message