tajo-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyunsik Choi (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TAJO-711) Add Avro storage support
Date Wed, 16 Apr 2014 06:22:16 GMT

    [ https://issues.apache.org/jira/browse/TAJO-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970483#comment-13970483
] 

Hyunsik Choi edited comment on TAJO-711 at 4/16/14 6:21 AM:
------------------------------------------------------------

Excellent! Big +1 for the latest patch. I tested the latest patch in a local cluster. It works
perfectly. Thank you for your awesome  contribution! I'll commit it if there are no additional
comment until today's night.

There is one very trivial suggestion. An instance of FileScanner including AvroScanner is
created, and then can be closed without invoking {{FileScanner::init()}} method. I'm sorry
for not mentioning it in javadoc. Anyway, {{FileScanner::close()}} should check the nullity
of member variables.

As I mentioned, I tested the patch on a local cluster. First of all, I prepared the avro schema
as follows:

{code}
{
  "type": "record",
  "namespace": "org.apache.tajo",
  "name": "table1",
  "fields": [
    { "name": "id", "type": "int" },
    { "name": "name", "type": "string" }
  ]
}
{code}

Then, I created one database and one table as follows:
{code}
default> create database avro2;
Ok

default> \c avro2

avro> create table avro2 (id int, name text) using avro with ('avro.schema.url' = 'file:///home/hyunsik/schema.avsc');
Ok
avro> \d avro2

table name: avro.avro2
table path: hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2
store type: AVRO
number of rows: 0
volume: 0 B
Options: 
  'avro.schema.url'='file:///home/hyunsik/schema.avsc'

schema: 
id  INT4
name  TEXT
{code}

Next, I inserted rows 6,001,215 rows to the avro table via {{INSERT OVERWRITE INTO}} statement
as follows:

{code}
avro> insert overwrite into avro2 (id, name) select l_orderkey::int4, l_returnflag from
tpch.lineitem;
Progress: 8%, response time: 0.397 sec
Progress: 17%, response time: 1.2 sec
Progress: 69%, response time: 2.202 sec
Progress: 100%, response time: 2.909 sec
final state: QUERY_SUCCEEDED, response time: 2.909 sec
OK
{code}

I checked the generated files.
{noformat}
[hyunsik@local05 hadoop-2.3.0]$ bin/hadoop dfs -ls hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/hyunsik/Code/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0
which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link
it with '-z noexecstack'.
14/04/16 14:43:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Found 23 items
-rw-r--r--   3 hyunsik supergroup    1331444 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000000
-rw-r--r--   3 hyunsik supergroup    1335487 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000001
-rw-r--r--   3 hyunsik supergroup    1335522 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000002
-rw-r--r--   3 hyunsik supergroup    1351444 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000003
-rw-r--r--   3 hyunsik supergroup    1590096 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000004
-rw-r--r--   3 hyunsik supergroup    1590222 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000005
-rw-r--r--   3 hyunsik supergroup    1589538 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000006
-rw-r--r--   3 hyunsik supergroup    1590408 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000007
-rw-r--r--   3 hyunsik supergroup    1590168 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000008
-rw-r--r--   3 hyunsik supergroup    1589226 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000009
-rw-r--r--   3 hyunsik supergroup    1589688 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000010
-rw-r--r--   3 hyunsik supergroup    1589790 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000011
-rw-r--r--   3 hyunsik supergroup    1590048 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000012
-rw-r--r--   3 hyunsik supergroup    1590204 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000013
-rw-r--r--   3 hyunsik supergroup    1590234 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000014
-rw-r--r--   3 hyunsik supergroup    1589562 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000015
-rw-r--r--   3 hyunsik supergroup    1590276 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000016
-rw-r--r--   3 hyunsik supergroup    1590720 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000017
-rw-r--r--   3 hyunsik supergroup    1590198 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000018
-rw-r--r--   3 hyunsik supergroup    1589508 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000019
-rw-r--r--   3 hyunsik supergroup    1590042 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000020
-rw-r--r--   3 hyunsik supergroup    1589814 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000021
-rw-r--r--   3 hyunsik supergroup    1026861 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000022
{noformat}

Then, I tried to execute some simple and distributed queries:

{noformat}
avro> select id from avro2 limit 10;
Progress: 100%, response time: 0.351 sec
final state: QUERY_SUCCEEDED, response time: 0.351 sec
result: 10 rows (80 B)
id
-------------------------------
1860579
1860579
1860579
1860580
1860580
1860580
1860580
1860580
1860580
1860581

avro> select id, name from avro2 order by id asc limit 10;
Progress: 8%, response time: 0.399 sec
Progress: 41%, response time: 1.202 sec
Progress: 100%, response time: 1.574 sec
final state: QUERY_SUCCEEDED, response time: 1.574 sec
result: 10 rows (40 B)
id,  name
-------------------------------
1,  N
1,  N
1,  N
1,  N
1,  N
1,  N
2,  N
3,  R
3,  R
3,  A
avro> select id, name from avro2 order by id desc limit 10;
Progress: 6%, response time: 0.401 sec
Progress: 45%, response time: 1.203 sec
Progress: 100%, response time: 1.551 sec
final state: QUERY_SUCCEEDED, response time: 1.551 sec
result: 10 rows (100 B)
id,  name
-------------------------------
6000000,  N
6000000,  N
5999975,  R
5999975,  A
5999975,  A
5999974,  R
5999974,  R
5999973,  N
5999972,  N
5999972,  N

avro> select count(id), count(name) from avro2;
Progress: 19%, response time: 0.401 sec
Progress: 100%, response time: 0.776 sec
final state: QUERY_SUCCEEDED, response time: 0.776 sec
result: 1 rows (16 B)
?count,  ?count_1
-------------------------------
6001215,  6001215
{noformat}



was (Author: hyunsik):
Excellent! Big +1 for the latest patch. I tested the latest patch in real cluster. It works
perfectly. Thank you for your awesome  contribution! I'll commit it if there are no additional
comment until today's night.

There is one very trivial suggestion. An instance of FileScanner including AvroScanner is
created, and then can be closed without invoking {{FileScanner::init()}} method. I'm sorry
for not mentioning it in javadoc. Anyway, {{FileScanner::close()}} should check the nullity
of member variables.

I verified the patch on a local cluster. First of all, I prepared the avro schema as follows:

{code}
{
  "type": "record",
  "namespace": "org.apache.tajo",
  "name": "table1",
  "fields": [
    { "name": "id", "type": "int" },
    { "name": "name", "type": "string" }
  ]
}
{code}

Then, I created one database and one table as follows:
{code}
default> create database avro2;
Ok

default> \c avro2

avro> create table avro2 (id int, name text) using avro with ('avro.schema.url' = 'file:///home/hyunsik/schema.avsc');
Ok
avro> \d avro2

table name: avro.avro2
table path: hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2
store type: AVRO
number of rows: 0
volume: 0 B
Options: 
  'avro.schema.url'='file:///home/hyunsik/schema.avsc'

schema: 
id  INT4
name  TEXT
{code}

Next, I inserted rows 6,001,215 rows to the avro table via {{INSERT OVERWRITE INTO}} statement
as follows:

{code}
avro> insert overwrite into avro2 (id, name) select l_orderkey::int4, l_returnflag from
tpch.lineitem;
Progress: 8%, response time: 0.397 sec
Progress: 17%, response time: 1.2 sec
Progress: 69%, response time: 2.202 sec
Progress: 100%, response time: 2.909 sec
final state: QUERY_SUCCEEDED, response time: 2.909 sec
OK
{code}

I checked the generated files.
{noformat}
[hyunsik@local05 hadoop-2.3.0]$ bin/hadoop dfs -ls hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /home/hyunsik/Code/hadoop-2.3.0/lib/native/libhadoop.so.1.0.0
which might have disabled stack guard. The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c <libfile>', or link
it with '-z noexecstack'.
14/04/16 14:43:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
Found 23 items
-rw-r--r--   3 hyunsik supergroup    1331444 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000000
-rw-r--r--   3 hyunsik supergroup    1335487 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000001
-rw-r--r--   3 hyunsik supergroup    1335522 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000002
-rw-r--r--   3 hyunsik supergroup    1351444 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000003
-rw-r--r--   3 hyunsik supergroup    1590096 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000004
-rw-r--r--   3 hyunsik supergroup    1590222 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000005
-rw-r--r--   3 hyunsik supergroup    1589538 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000006
-rw-r--r--   3 hyunsik supergroup    1590408 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000007
-rw-r--r--   3 hyunsik supergroup    1590168 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000008
-rw-r--r--   3 hyunsik supergroup    1589226 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000009
-rw-r--r--   3 hyunsik supergroup    1589688 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000010
-rw-r--r--   3 hyunsik supergroup    1589790 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000011
-rw-r--r--   3 hyunsik supergroup    1590048 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000012
-rw-r--r--   3 hyunsik supergroup    1590204 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000013
-rw-r--r--   3 hyunsik supergroup    1590234 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000014
-rw-r--r--   3 hyunsik supergroup    1589562 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000015
-rw-r--r--   3 hyunsik supergroup    1590276 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000016
-rw-r--r--   3 hyunsik supergroup    1590720 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000017
-rw-r--r--   3 hyunsik supergroup    1590198 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000018
-rw-r--r--   3 hyunsik supergroup    1589508 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000019
-rw-r--r--   3 hyunsik supergroup    1590042 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000020
-rw-r--r--   3 hyunsik supergroup    1589814 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000021
-rw-r--r--   3 hyunsik supergroup    1026861 2014-04-16 14:40 hdfs://127.0.0.1:8020/tajo/warehouse/avro/avro2/part-01-000022
{noformat}

Then, I tried to execute some simple and distributed queries:

{noformat}
avro> select id from avro2 limit 10;
Progress: 100%, response time: 0.351 sec
final state: QUERY_SUCCEEDED, response time: 0.351 sec
result: 10 rows (80 B)
id
-------------------------------
1860579
1860579
1860579
1860580
1860580
1860580
1860580
1860580
1860580
1860581

avro> select id, name from avro2 order by id asc limit 10;
Progress: 8%, response time: 0.399 sec
Progress: 41%, response time: 1.202 sec
Progress: 100%, response time: 1.574 sec
final state: QUERY_SUCCEEDED, response time: 1.574 sec
result: 10 rows (40 B)
id,  name
-------------------------------
1,  N
1,  N
1,  N
1,  N
1,  N
1,  N
2,  N
3,  R
3,  R
3,  A
avro> select id, name from avro2 order by id desc limit 10;
Progress: 6%, response time: 0.401 sec
Progress: 45%, response time: 1.203 sec
Progress: 100%, response time: 1.551 sec
final state: QUERY_SUCCEEDED, response time: 1.551 sec
result: 10 rows (100 B)
id,  name
-------------------------------
6000000,  N
6000000,  N
5999975,  R
5999975,  A
5999975,  A
5999974,  R
5999974,  R
5999973,  N
5999972,  N
5999972,  N

avro> select count(id), count(name) from avro2;
Progress: 19%, response time: 0.401 sec
Progress: 100%, response time: 0.776 sec
final state: QUERY_SUCCEEDED, response time: 0.776 sec
result: 1 rows (16 B)
?count,  ?count_1
-------------------------------
6001215,  6001215
{noformat}


> Add Avro storage support
> ------------------------
>
>                 Key: TAJO-711
>                 URL: https://issues.apache.org/jira/browse/TAJO-711
>             Project: Tajo
>          Issue Type: New Feature
>            Reporter: David Chen
>            Assignee: David Chen
>         Attachments: TAJO-711.patch, TAJO-711.patch, TAJO-711_140415_rebased.patch, TAJO-711_20140413_20:36:40.patch,
TAJO-711_20140413_21:00:34.patch, TAJO-711_20140413_21:46:27.patch, TAJO-711_20140414_11:07:13.patch,
TAJO-711_20140415_11:13:43.patch
>
>
> Add {{FileScanner}} and {{FileAppender}} for reading from and writing to Avro.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message