Mailing-List: contact dev-help@calcite.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@calcite.incubator.apache.org
To: dev@kylin.incubator.apache.org, dev@calcite.incubator.apache.org,
 dev@kylin.incubator.apache.org
Subject: Kylin 0.7.1 - Failed to build a cube
MIME-Version: 1.0
In-Reply-To: 
 <CAHRce1MG+UD46UOi0=6qLzE+zYezWXrMC57kRJ9rhnfQmVoz4w@mail.gmail.com>
References: 
 <CAHRce1NwJQkCp4WNQa-ZN==AD1LhAaUgVcNqTi0tSHVr_hE+qA@mail.gmail.com>
 <D983FDEA-C7AC-4E67-A942-BE25A9E28FEE@apache.org>
 <CAHRce1MqTVcm_kR9XKKrgK1oZCmYrwUBkEwn+aCn3TptRJkEog@mail.gmail.com>
 <CAHRce1MG+UD46UOi0=6qLzE+zYezWXrMC57kRJ9rhnfQmVoz4w@mail.gmail.com>
Date: Tue, 07 Jul 2015 12:42:23 +0200
From: gaspare.maria@gfmintegration.it
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: 7bit
Message-ID: <nr45qn.r0nu53@mail.gfmintegration.it>

Hi,

I am trying to create a cube from a star schema created using Hive External tables (below an example) stored as TEXT FILE (CSV).

CREATE EXTERNAL TABLE IF NOT EXISTS USERS_TABLE  (
   uid INT,
   name STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073' LINES TERMINATED BY '\012'
STORED AS TEXTFILE
LOCATION '/data/users';
 

To CSV files are obtained from Spark RDDs, so they are saved as part-xxxx. Below the HDFS listing

hdfs dfs -ls /data/users
Found 12 items
-rw-r--r--   3 hdfs hdfs          0 2015-07-07 12:05 /data/users/_SUCCESS
-rw-r--r--   3 hdfs hdfs    3699360 2015-07-07 12:05 /data/users/part-00000
-rw-r--r--   3 hdfs hdfs    3694740 2015-07-07 12:05 /data/users/part-00001
-rw-r--r--   3 hdfs hdfs    3685374 2015-07-07 12:05 /data/users/part-00002
-rw-r--r--   3 hdfs hdfs    3719646 2015-07-07 12:05 /data/users/part-00003
-rw-r--r--   3 hdfs hdfs    3682476 2015-07-07 12:05 /data/users/part-00004
-rw-r--r--   3 hdfs hdfs    3679956 2015-07-07 12:05 /data/users/part-00005
-rw-r--r--   3 hdfs hdfs    3700242 2015-07-07 12:05 /data/users/part-00006
-rw-r--r--   3 hdfs hdfs    3672186 2015-07-07 12:05 /data/users/part-00007
-rw-r--r--   3 hdfs hdfs    3682350 2015-07-07 12:05 /data/users/part-00008
-rw-r--r--   3 hdfs hdfs    3680292 2015-07-07 12:05 /data/users/part-00009
-rw-r--r--   3 hdfs hdfs    3697722 2015-07-07 12:05 /data/users/part-00010

The CUBE build JOB fails when try to build the Dimension Dictionary with the following exception (it seems that the Hive Table data directory MUST contain only one file)

java.lang.IllegalStateException: Expect 1 and only 1 non-zero file under hdfs://gas.gfmintegration.it:8020/data/cdr/bb/dimensions/users, but find 11
	at org.apache.kylin.dict.lookup.HiveTable.findOnlyFile(HiveTable.java:123)
	at org.apache.kylin.dict.lookup.HiveTable.computeHDFSLocation(HiveTable.java:107)
	at org.apache.kylin.dict.lookup.HiveTable.getHDFSLocation(HiveTable.java:83)
	at org.apache.kylin.dict.lookup.HiveTable.getFileTable(HiveTable.java:76)
	at org.apache.kylin.dict.lookup.HiveTable.getSignature(HiveTable.java:71)
	at org.apache.kylin.dict.DictionaryManager.buildDictionary(DictionaryManager.java:164)
	at org.apache.kylin.cube.CubeManager.buildDictionary(CubeManager.java:154)
	at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:53)
	at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegment(DictionaryGeneratorCLI.java:42)
	at org.apache.kylin.job.hadoop.dict.CreateDictionaryJob.run(CreateDictionaryJob.java:53)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
	at org.apache.kylin.job.common.HadoopShellExecutable.doWork(HadoopShellExecutable.java:63)
	at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
	at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:50)
	at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
	at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:132)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)

result code:2


Do you have any indications on how to create a proper Hive star schema for Kylin? 

I would like to use external tables (stored as CSV, parquet files or HBase) because I need to process the same data also from Spark.

Thanks in advance.

BR,

-- gas