Question ...

  Is there a way to optimize the first three steps of a Kylin build?

  Total build time of a development cube is 626 minutes and a break down by steps:
  1. 87  min - Create Intermediate Flat Hive Table
  2. 207 min -  Redistribute Flat Hive Table
  3. 248 min -  Extract Fact Table Distinct Columns
  4. 0   min
  5. 0   min
  6. 62  min -  Build Cube with Spark
  7. 19  min -  Convert Cuboid Data to HFile
  8. 0   min
  9. 0   min
  10. 0   min
  11. 0   min
   The data set is summary files (~35M records) and detail files (~4B records - 40GB compressed).

   There is a join needed for the final data which is handled in a view within hive.  So I do expect a performance cost there.

   However, staging the data other ways (loading to sequence/org file vs external table to bz2 files) there is no net-gain.

   This means, pre-processing the data externally can make Kylin run a little faster but the overall time from absolute start to finish is still ~600min.

   Steps 1/2 seem to be a redundancy given how my data is structured; the hsql/sql commands Kylin sends to Hive could be done before the build process.

   Is it possible to optimize steps 1/2/3? Is it possible to skip steps 1/2 and jump to step 3 if the data was staged as-needed/correctly beforehand?

   My guess is there are mostly 'no' answers where (which is fine) but thought I'd ask.

   (The test lab is getting doubled in size today so I'm not ultimately worried but I'm seeking other improvements vs. only adding hardware and networking)

Thanks! J