tajo-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyunsik Choi (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TAJO-711) Add Avro storage support
Date Wed, 16 Apr 2014 05:14:17 GMT

    [ https://issues.apache.org/jira/browse/TAJO-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970429#comment-13970429
] 

Hyunsik Choi edited comment on TAJO-711 at 4/16/14 5:13 AM:
------------------------------------------------------------

This is my comment for the concept of schema evolving table.

Few days ago, I discussed your idea with Hyoungjun in offline. We were very happy to see your
interesting idea. I got some additional suggestion from Hyoungjun, and I add my some concrete
ideas to them.

I'd like to give some assumption and define some terms before I discuss the idea.

 * A partitioned table has a schema.
  ** Let us call this schema 'parent schema'.
 * Each partition has its own schema.
  ** Let us call this schema 'partition schema'.
 * Let us call this kind of table 'a schema-evolving table'.

 (I know that my naming sense is not good. They are temporary names. I hope that some guys
suggest better names.)

The rough idea is as follows:

 * Even though a schema is actually an ordered set of fields, we see the schema is just a
set of fields when we deals with the relationship between parent schema and partition schemas.
 * The schema of a schema evolving table must be a super set of all fields in partition schemas.
 * The field set in each schema must be a subset of the parent schema.
 * The same name fields in all partition schemas including the parent schema must be the same
data types.
 * The partition schemas among partitions can be different one another.
 * The order of schema fields among partitions can be different. (It's because we just see
the fields as a set.)
 * Newly added fields of new partitions are added to the tail of the parent schema.
   ** The schema maintenance will be performed when 'ALTER TABLE ADD PARTITION' is executed.

In planning phases, Tajo will use only the parent schema, and then it will rewrites some projection
plan for each partition if needed. When there is no corresponding field required in a query
in a certain partition, the field will be NULL value in the processing on the partition. Processing
multiple partitions with different schemas will output tuples with the same schema via the
same projection.


was (Author: hyunsik):
This is my comment for the concept of schema evolving table.

Few days ago, I discussed your idea with Hyoungjun in offline. We were very happy to see your
interesting idea. I got some additional suggestion from Hyoungjun, and I add my some concrete
ideas to them.

I'd like to give some assumption and define some terms before I discuss the idea.

 * A partitioned table has a schema.
  ** Let us call this schema 'parent schema'.
 * Each partition has its own schema.
  ** Let us call this schema 'partition schema'.
 * Let us call this kind of table 'a schema-evolving table'.

 (I know that my naming sense is not good. They are temporary names. I hope that some guys
suggest better names.)

The rough idea is as follows:

 * Even though a schema is actually an ordered set of fields, we see the schema is just a
set of fields when we deals with the relationship between parent schema and partition schemas.
 * The schema of a schema evolving table must be a super set of all fields in partition schemas.
 * The field set in each schema must be a subset of the parent schema.
 * The same name fields in all partition schemas including the parent schema must be the same
data types.
 * The partition schemas among partitions can be different one another.
 * The order of schema fields among partitions can be different. (It's because we just see
the fields as a set.)
 * Newly added fields of new partitions are added to the tail of the parent schema.
   ** The schema maintenance will be performed when 'ALTER TABLE ADD PARTITION' is executed.

In planning phases, Tajo will use only the parent schema, and then it will rewrites some projection
plan for each partition if needed. When there is no corresponding field required in a query
in a certain partition, the field will be NULL value in the processing on the partition.

> Add Avro storage support
> ------------------------
>
>                 Key: TAJO-711
>                 URL: https://issues.apache.org/jira/browse/TAJO-711
>             Project: Tajo
>          Issue Type: New Feature
>            Reporter: David Chen
>            Assignee: David Chen
>         Attachments: TAJO-711.patch, TAJO-711.patch, TAJO-711_140415_rebased.patch, TAJO-711_20140413_20:36:40.patch,
TAJO-711_20140413_21:00:34.patch, TAJO-711_20140413_21:46:27.patch, TAJO-711_20140414_11:07:13.patch,
TAJO-711_20140415_11:13:43.patch
>
>
> Add {{FileScanner}} and {{FileAppender}} for reading from and writing to Avro.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message