spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nirav Patel <>
Subject API to study key cardinality and distribution and other important statistics about data at certain stage
Date Fri, 13 May 2016 19:04:24 GMT

Problem is every time job fails or perform poorly at certain stages you
need to study your data distribution just before THAT stage. Overall look
at input data set doesn't help very much if you have so many transformation
going on in DAG. I alway end up writing complicated typed code to run
analysis vs actual job to identify this. Shouldn't there be spark api to
examine this in better way. After all it does go through all the records
(in most cases) to perform transformation or action so as a side job it can
gather statistics as well when instructed.



[image: What's New with Xactly] <>

<>  [image: LinkedIn] 
<>  [image: Twitter] 
<>  [image: Facebook] 
<>  [image: YouTube] 

View raw message