tajo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyunsik Choi <hyun...@apache.org>
Subject Re: USING Parquet
Date Sat, 23 Aug 2014 17:27:23 GMT
Hi Chris,

Currently, Parquet file format does not support Timestamp, Date, and
Time data type. Parquet community currently is working on those data
types. So, besides Tajo, other systems do not support those data
types. Now, you need to use TEXT data type to handle timestamp if you
must use Parquet format. If so, you can achieve the same features by
using date/time functions.

Thanks,
Hyunsik

On Sun, Aug 24, 2014 at 12:59 AM, Christian Schwabe
<Christian.Schwabe@gmx.com> wrote:
> Hello together,
>
> with large test data (.csv > 5 GB) I now wanted to do some tests.
> Unfortunately it fails again pretty early. I have an EXTERNAL TABLE applied
> as follows:
>
> CREATE EXTERNAL TABLE dfkklocks_hist
> (
>   validfrom timestamp,
>   validthru timestamp,
>   client text,
>   loobj1 text,
>   lotyp text,
>   proid text,
>   lockr text,
>   fdate date,
>   tdate date,
>   gpart text,
>   vkont text,
>   cond_loobj text,
>   actkey text,
>   uname text,
>   adatum date,
>   azeit text,
>   protected text,
>   laufd date,
>   laufi text
> )
> using csv with ('csvfile.delimiter'='~') location ‚file:path/to/csv/file;
>
> Then I create a table with the suffix *_internal and the parquet type as
> follows:
>
> CREATE TABLE dfkklocks_hist_internal
> (
>   validfrom timestamp,
>   validthru timestamp,
>   client text,
>   loobj1 text,
>   lotyp text,
>   proid text,
>   lockr text,
>   fdate date,
>   tdate date,
>   gpart text,
>   vkont text,
>   cond_loobj text,
>   actkey text,
>   uname text,
>   adatum date,
>   azeit text,
>   protected text,
>   laufd date,
>   laufi text
> ) using parquet;
>
>
> This csv-file contains records such as these:
> 2014-08-19 21:03:32.78~9999-12-31
> 23:59:59.999~200~0000000000530010000053~06~01~5~2005-12-31~9999-12-31~0010000053~000000000053~~~FREITAG~2006-06-01~125611~~1800-01-01~
>
> Now I would like to insert content from cdv-file to the table using parquet
> as follows::
> contract> INSERT INTO dfkklocks_hist_internal SELECT * FROM dfkklocks_hist;
> ERROR: Cannot convert Tajo type: TIMESTAMP
> java.lang.RuntimeException: Cannot convert Tajo type: TIMESTAMP
> at
> org.apache.tajo.storage.parquet.TajoSchemaConverter.convertColumn(TajoSchemaConverter.java:191)
> at
> org.apache.tajo.storage.parquet.TajoSchemaConverter.convert(TajoSchemaConverter.java:150)
> at
> org.apache.tajo.storage.parquet.TajoWriteSupport.<init>(TajoWriteSupport.java:54)
> at
> org.apache.tajo.storage.parquet.TajoParquetWriter.<init>(TajoParquetWriter.java:80)
> at
> org.apache.tajo.storage.parquet.ParquetAppender.init(ParquetAppender.java:75)
> at
> org.apache.tajo.engine.planner.physical.StoreTableExec.init(StoreTableExec.java:69)
> at org.apache.tajo.worker.Task.run(Task.java:423)
> at org.apache.tajo.worker.TaskRunner$1.run(TaskRunner.java:425)
> at java.lang.Thread.run(Thread.java:745)
>
> In TajoSchemaConverter.java it looks as if it would not be possible to use a
> Tajo timestamp in parquet. Am I right with the assumption?
> Change the timestamp value (see example data set) also did not bring me to
> success. I had, at first, the assumption that the timestamp is not valid.
> But timestamp values like eg: 1970-00-00 00: 00: 00.000 or 1971-01-01 01:
> 01: 01 000 showed no change in behavior.
> Are my conclusions thus far correct? Is this an outstanding bug? Am I doing
> something wrong maybe? What chance would there still that could lead me to
> the goal that I have not yet listed here?
>
> private Type convertColumn(Column column) {
>     TajoDataTypes.Type type = column.getDataType().getType();
>     switch (type) {
>       case BOOLEAN:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.BOOLEAN);
>       case BIT:
>       case INT2:
>       case INT4:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.INT32);
>       case INT8:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.INT64);
>       case FLOAT4:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.FLOAT);
>       case FLOAT8:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.DOUBLE);
>       case CHAR:
>       case TEXT:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.BINARY,
>                          OriginalType.UTF8);
>       case PROTOBUF:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.BINARY);
>       case BLOB:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.BINARY);
>       case INET4:
>       case INET6:
>         return primitive(column.getSimpleName(),
>                          PrimitiveType.PrimitiveTypeName.BINARY);
>       default:
>         throw new RuntimeException("Cannot convert Tajo type: " + type);
>     }
>   }
>
> I'm really thankful that there is a community like you guys out there that
> fix a support in such errors together.
> Have a nice weekend.
>
> Best regards,
> Chris

Mime
View raw message