carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aniket Adnaik (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CARBONDATA-1072) Streaming Ingestion Feature
Date Sat, 19 Aug 2017 01:14:00 GMT

     [ https://issues.apache.org/jira/browse/CARBONDATA-1072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aniket Adnaik updated CARBONDATA-1072:
--------------------------------------
    Attachment: StreamingIngestionSupportInCarbonData.pdf

> Streaming Ingestion Feature 
> ----------------------------
>
>                 Key: CARBONDATA-1072
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-1072
>             Project: CarbonData
>          Issue Type: New Feature
>          Components: core, data-load, data-query, examples, file-format, spark-integration,
sql
>    Affects Versions: NONE
>            Reporter: Aniket Adnaik
>             Fix For: NONE
>
>         Attachments: StreamingIngestionSupportInCarbonData.pdf
>
>
> High level break down of work Items/Implementation phases:
> Design document will be attached soon.
>  
> Phase – 1 – Spark Structured Streaming with regular Carbondata Format
> ----------------------------
>     This phase will mainly focus on supporting Streaming ingestion using 
>     Spark Structured streaming 
>     1.	Write Path Implementation 
>        - Integration with Spark’s Structured Streaming framework  
>            (FileStreamSink etc)
>        - StreamingOutputWriter (StreamingOuputWriterFactory)
>        - Prepare Write  (Schema Validation, Segment creation, 
>           Streaming file creation etc)
>        - StreamingRecordWriter ( Data conversion from Catalyst InternalRow
>          to Carbondata compatible format , make use of new load path) 
>      2. Read Path Implementation (some overlap with phase-2)
>       -	Modify getsplits() to read from Streaming Segment
>       -	Read commited info from meta data to get correct offsets
>       -	Make use of Min-Max index if available 
>       -	Use sequential scan - data is unsorted , cannot use Btree index 
>     3.	Compaction
>      -	Minor Compaction
>      -	Major Compaction
>    4.	Metadata Management
>      - Streaming metadata store (e.g. Offsets, timestamps etc.)
>    
>    5.	Failure Recovery
>       -	Rollback on failure
>       -	Handle asynchronous writes to CarbonData (using hflush) 
> -----------------------------
> Phase – 2 : Spark Structured Streaming with Appendable CarbonData format
>      1.Streaming File Format
>      - Writers use V3 file format for appending Columnar unsorted 
>        data blockets
>      - Modify Readers to read from appendable streaming file format
> -----------------------------
> Phase -3 : 
>     1. Inter-opertability Support
>      - Functionality with other features/Components
>      - Concurrent queries with streaming ingestion
>      - Concurrent operations with Streaming Ingestion (e.g. Compaction, 
>       Alter table, Secondary Index etc.)
>     2. Kafka Connect Ingestion / Carbondata connector
>      - Direct ingestion from Kafka Connect without Spark Structured 
>         Streaming
>      - Separate Kafka  Connector to receive data through network port
>      - Data commit and Offset management
> -----------------------------
> Phase-4 : Support for other streaming engines
>     - Analysis of Streaming APIs/interface  with other streaming engines
>     - Implementation of connectors  for different streaming engines storm, 
>        flink , flume, etc.
> ----------------------------
> Phase -5 : In-memory Streaming table (probable feature)
>    1.	In-memory Cache for Streaming data 
>      -	Fault tolerant  in-memory buffering / checkpoint with WAL
>      -	Readers read from in-memory tables if available
>      -	Background threads for writing streaming data ,etc.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message