tajo-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Tajo Wiki] Update of "Roadmap" by HyunsikChoi
Date Tue, 26 Mar 2013 09:28:36 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tajo Wiki" for change notification.

The "Roadmap" page has been changed by HyunsikChoi:

Moved the roadmap from github wiki.

New page:
= Roadmap =

== Milestone ==
 * 0.2 - first release as an incubating project focused on ASF compliance
 * 0.3 - more stable API and robust features and a rudimentary cost-based optimizer
 * 0.4 - more SQL supports and more improved cost-based optimizer
 * 0.5 - a native columnar execution engine

== Long Term Plan ==
 * Integration with Hadoop ecosystem
  * Tajo catalog needs to support HCatalog or needs to be compatible to Hive meta.
 * The native columnar execution engine
 * Cost-based optimization which also includes a rewrite rule engine and various rewrite rules

== Short/Mid Term Plan ==
 * Improvement of the DAG framework
  * Query is both FSM and a DAG representation.
    * It would be good to separate Query to a FSM part and a DAG part.
  * We need easier interface to edit and build DAGs.
 * RCFile
  * In the current implementation, RCFile is not compatible to Hive's one because Tajo's RCFile
uses Datum to (de)serialize data. So, we will have additional RCFile wrapper class compatible
to Hive's files.
 * ORCFile
  * It looks promising. We need to port ORCFile.
 * Trevni
  * TrevniScanner works well in most cases. However, it doesn't support null value. We need
to handle it.
 *  hadoop security in tajo-rpc
  * tajo-rpc does not support hadoop security. Since Tajo will be a part of Hadoop ecosystem,
we need to apply hadoop security to tajo-rpc.
 * Intermediate Data Format
  * As I mentioned above,  Tajo uses CSV as the intermediatee data format. It may cause CPU
overhead and is relatively large to be transmitted via networks. We need to change it.
 * JDBC/ODBC drivers
  * Tajo is a relational DW system. If we have such connectors, it can be easily integrated
with existing BI and OLAP tools.
 * Restful API
  * It's very useful for web-based applications.
 * Proper resource allocation for SubQuery (i.e., Execution Block in PPT)
    * SubQuery is one step of multiple query steps. For each subquery, QueryMaster launches
TaskRunners via Yarn, and the launched TaskRunners are reused within a subquery.
    * Now, QueryMaster assigns the fixed-sized resource (2G memory) to subqueries regardless
of necessary resource. We need to improve it to allocate proper resources to subqueries. For
example, QueryMaster assigns 1G to one subquery for only scan or assigns 2G to another subquery
including joins. 
 * Error handling of TajoCli
   * TajoCli is a command line interface that uses Jline2. However, its error handling is
awful. It frequently halts when trivial exceptions onccur.
 * SQL data types
   * Currently, Tajo provides data types (i.e., byte, bool, int, long, float, double, bytes,
and string) based on Java primitive types. Tajo should support SQL standard data types.
 * Local mode
   *  Queries are always executed in a distributed mode. In other words, it always uses Yarn.
However, it is inconvenience for debugging and is inefficient in single machine. We need to
implement something for local mode.
 * Parallel launch of containers 
   * Currently, node containers are executed sequentially (see TaskRunnerLauncherImpl.java).
It looks very inefficient. We can improve it by using ExecutorService.
 * Output commit
   * In some cases, Tajo is fault tolerance. It requires output commit mechanism. However,
Tajo does not support it, and we need this feature.
 * Broadcast join and Limit operator
   * As I mentioned before, they are disabled after Yarn port. We should enable them.
 * HbaseScanner/Appender
   * Hbase will be a great storage for Tajo.

View raw message