spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ian O'Connell" <ianoconn...@gmail.com>
Subject sparkSQL thread safe?
Date Thu, 10 Jul 2014 22:15:44 GMT
Had a few quick questions...

Just wondering if right now spark sql is expected to be thread safe on
master?

doing a simple hadoop file -> RDD -> schema RDD -> write parquet

will fail in reflection code if i run these in a thread pool.

The SparkSqlSerializer, seems to create a new Kryo instance each time it
wants to serialize anything. I got a huge speedup when I had any
non-primitive type in my SchemaRDD using the ResourcePool's from Chill for
providing the KryoSerializer to it. (I can open an RB if there is some
reason not to re-use them?)

====

With the Distinct Count operator there is no map-side operations, and a
test to check for this. Is there any reason not to do a map side combine
into a set and then merge the sets later? (similar to the approximate
distinct count operator)

===

Another thing while i'm mailing.. the 1.0.1 docs have a section like:
"
// Note: Case classes in Scala 2.10 can support only up to 22 fields. To
work around this limit, // you can use custom classes that implement the
Product interface.
"

Which sounds great, we have lots of data in thrift.. so via scrooge (
https://github.com/twitter/scrooge), we end up with ultimately instances of
traits which implement product. Though the reflection code appears to look
for the constructor of the class and base the types based on those
parameters?


Ian.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message