kylin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Congling Xia (Jira)" <>
Subject [jira] [Created] (KYLIN-4320) number of replicas of Cuboid files cannot be configured for Spark engine
Date Mon, 30 Dec 2019 08:08:00 GMT
Congling Xia created KYLIN-4320:

             Summary: number of replicas of Cuboid files cannot be configured for Spark engine
                 Key: KYLIN-4320
             Project: Kylin
          Issue Type: Bug
          Components: Job Engine
            Reporter: Congling Xia
         Attachments: cuboid_replications.png

Hi, team. I try to change `dfs.replication` to 3 by adding the following config override
Then, I get a strange result - numbers of replicas of cuboid files varies even though they
are in the same level.


I guess it is due to the conflicting settings is SparkUtil:
public static void modifySparkHadoopConfiguration(SparkContext sc) throws Exception {
    sc.hadoopConfiguration().set("dfs.replication", "2"); // cuboid intermediate files, replication=2
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress", "true");
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.type", "BLOCK");
    sc.hadoopConfiguration().set("mapreduce.output.fileoutputformat.compress.codec", "");
// or
It may be a bug for Spark property precedence. After checking [Spark documents|[]],
it seems that some programmatically set properties may not take effect and it is not a recommended
way for Spark job configuration.


Anyway, cuboid files may survive for weeks until expired or been merged, the configuration
rewrite in `org.apache.kylin.engine.spark.SparkUtil#modifySparkHadoopConfiguration` makes
those files less reliable.

Is there any way to force cuboid files to remain 3 replicas? or shall we remove the code in
SparkUtil to make kylin.engine.spark-conf.spark.hadoop.dfs.replication work properly?

This message was sent by Atlassian Jira

View raw message