apex-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Pujare <san...@datatorrent.com>
Subject Re: HiveOutputModule creating extra directories, than specified, while saving data into HDFS
Date Tue, 16 May 2017 23:39:48 GMT
Vivek,

Take a look at HiveOutputModule.populateDAG() (
https://github.com/apache/apex-malhar/blob/master/hive/src/main/java/org/apache/apex/malhar/hive/HiveOutputModule.java
)

This is a sub-DAG with fsRolling (FSPojoToHiveOperator) and hiveStore (
FSPojoToHiveOperator) using the file-path you have supplied (
/common/data/test/accessCounts).

If you look at the code
in com.datatorrent.contrib.hive.AbstractFSRollingOutputOperator.setup(OperatorContext)
  (superclass of  FSPojoToHiveOperator) it does construct a path for
rolling temporary files along the lines you have observed. But the final
output should be in the output path you specified if you wait long enough
for the creation of those files.



On Tue, May 16, 2017 at 12:53 PM, bhidevivek <bhide.vivek@gmail.com> wrote:

> H All, I am trying to use HiveOutput Module to insert the ingested data
> into
> hive external table. The table is already created with the same location as
> /dt.application.<app_name>.operator.hiveOutput.prop.filePath/ property and
> partition column is accessdate. With below configurations in property file,
> the hdfs file structure I am expecting is
>
> /common/data/test/accessCounts
>                                                 |
>                                                 ----- accessdate=2017-05-15
>                                                                         |
>
> ------- <fil1>
>
> ------- <fil2>
>                                                 ----- accessdate=2017-05-16
>                                                                         |
>
> ------- <fil1>
>
> ------- <fil2>
>
> but the actual structure look like
>
> /common/data/test/accessCounts/<yarn_application_id_for_apex_
> ingest_appl>/10
>
>                                                  |
>
>                                                  ----- 2017-05-15
>
>                                                            |
>
>                                                            ------- <fil1>
>
>                                                            ------- <fil2>
>
>                                                 |
>
>                                                  ----- 2017-05-16
>
>                                                            |
>
>                                                            ------- <fil1>
>
>                                                            ------- <fil2>
>
> Questions
> 1. Why the yarn_application_id and some other extra directories are created
> when it is no where specified in config
> 2. If I want to achieve the structure I want, what other configurations I
> will need to set?
>
> HiveOutputModule Configs
> ==================
>
> <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.filePath
>                 </name>
>                 <value>/common/data/test/accessCounts</value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.databaseUrl
>                 </name>
>                 <value><jdbc_url></value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.databaseDriver
>                 </name>
>                 <value>org.apache.hive.jdbc.HiveDriver</value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.tablename
>                 </name>
>                 <value><hive table name where records needs to be
> inserted></value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hivePartitionColumns
>                 </name>
>                 <value>{accessdate}</value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.password
>                 </name>
>                 <value><hive connection password></value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.userName
>                 </name>
>                 <value><hive connection user></value>
>         </property>
>         <property>
>                 <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hiveColumns
>                 </name>
>                 <value>{col1,col2,col3,col4}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hiveColumnDataTypes
>                 </name>
>                 <value>{STRING,STRING,STRING,STRING}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.hivePartitionColumns
>                 </name>
>                 <value>{accessdate}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.prop.
> hivePartitionColumnDataTypes
>                 </name>
>                 <value>{STRING}</value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.
> prop.expressionsForHiveColumns
>                 </name>
>                 <value>{"getCol1()","getCol2()","getCol3()","getCol4()"}</
> value>
>         </property>
>         <property>
>
> <name>dt.application.<app_name>.operator.hiveOutput.prop.
> expressionsForHivePartitionColumns
>                 </name>
>                 <value>{"getAccessdate()"}</value>
>         </property>
>
>
>
> --
> View this message in context: http://apache-apex-users-list.
> 78494.x6.nabble.com/HiveOutputModule-creating-extra-directories-than-
> specified-while-saving-data-into-HDFS-tp1620.html
> Sent from the Apache Apex Users list mailing list archive at Nabble.com.
>

Mime
View raw message