carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aniket Adnaik (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CARBONDATA-1072) Streaming Ingestion Feature
Date Sat, 20 May 2017 03:52:04 GMT
Aniket Adnaik created CARBONDATA-1072:
-----------------------------------------

             Summary: Streaming Ingestion Feature 
                 Key: CARBONDATA-1072
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1072
             Project: CarbonData
          Issue Type: New Feature
          Components: core, data-load, data-query, examples, file-format, spark-integration,
sql
    Affects Versions: 1.1.0
            Reporter: Aniket Adnaik
             Fix For: 1.2.0


High level break down of work Items/Implementation phases:
Design document will be attached soon.
 
Phase – 1 – Spark Structured Streaming with regular Carbondata Format
----------------------------
    This phase will mainly focus on supporting Streaming ingestion using 
    Spark Structured streaming 
    1.	Write Path Implementation 
       - Integration with Spark’s Structured Streaming framework  
           (FileStreamSink etc)
       - StreamingOutputWriter (StreamingOuputWriterFactory)
       - Prepare Write  (Schema Validation, Segment creation, 
          Streaming file creation etc)
       - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
         to Carbondata compatible format , make use of new load path) 

     2. Read Path Implementation (some overlap with phase-2)
      -	Modify getsplits() to read from Streaming Segment
      -	Read commited info from meta data to get correct offsets
      -	Make use of Min-Max index if available 
      -	Use sequential scan - data is unsorted , cannot use Btree index 

    3.	Compaction
     -	Minor Compaction
     -	Major Compaction

   4.	Metadata Management
     - Streaming metadata store (e.g. Offsets, timestamps etc.)
   
   5.	Failure Recovery
      -	Rollback on failure
      -	Handle asynchronous writes to CarbonData (using hflush) 
----------------------------	
Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
     1.Streaming File Format
       - Writers use V3 file format for appending Columnar unsorted 
         data blockets
      -	Modify Readers to read from appendable streaming file format
-----------------------------	
Phase -3 : 
     1. Inter-opertability Support
       - Functionality with other features/Components
       - Concurrent queries with streaming ingestion
       - Concurrent operations with Streaming Ingestion (e.g. Compaction, 
          Alter table, Secondary Index etc.
    2.	Kafka Connect Ingestion / Carbondata connector
      -	Direct ingestion from Kafka Connect without Spark Structured 
        Streaming
      -	Separate Kafka  Connector to receive data through network port
      -	Data commit and Offset management
-----------------------------
Phase-4 : Support for other streaming engines
     -  Analysis of Streaming APIs/interface  with other streaming engines
     - Implementation of connectors  for different streaming engines storm, 
       flink , flume, etc.

-----------------------------
Phase -5 : In-memory Streaming table (probable feature)
-----------------------------
   1.	In-memory Cache for Streaming data 
     -	Fault tolerant  in-memory buffering / checkpoint with WAL
     -	Readers read from in-memory tables if available
     -	Background threads for writing streaming data ,etc.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message