spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hamel Kothari <hamelkoth...@gmail.com>
Subject Skipping Type Conversion and using InternalRows for UDF
Date Fri, 15 Apr 2016 15:44:53 GMT
Hi all,

So we have these UDFs which take <1ms to operate and we're seeing pretty
poor performance around them in practice, the overhead being >10ms for the
projections (this data is deeply nested with ArrayTypes and MapTypes so
that could be the cause). Looking at the logs and code for ScalaUDF, I
noticed that there are a series of projections which take place before and
after in order to make the Rows safe and then unsafe again. Is there any
way to opt out of this and input/return InternalRows to skip the
performance hit of the type conversion? It doesn't immediately appear to be
possible but I'd like to make sure that I'm not missing anything.

I suspect we could make this possible by checking if typetags in the
register function are all internal types, if they are, passing a false
value for "needs[Input|Output]Conversion" to ScalaUDF and then in ScalaUDF
checking for that flag to figure out if the conversion process needs to
take place. We're still left with the issue of missing a schema in the case
of outputting InternalRows, but we could expose the DataType parameter
rather than inferring it in the register function. Is there anything else
in the code that would prevent this from working?

Regards,
Hamel

Mime
View raw message