carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ravindra Pesala (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CARBONDATA-742) Add batch sort to improve the loading performance
Date Fri, 03 Mar 2017 06:02:45 GMT

     [ https://issues.apache.org/jira/browse/CARBONDATA-742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ravindra Pesala updated CARBONDATA-742:
---------------------------------------
    Description: 
Current Problem:
Sort step is major issue as it is blocking step. It needs to receive all data and write down
the sort temp files to disk, after that only data writer step can start.

Solution: 
Make sort step as non blocking step so it avoids waiting of Data writer step.
Process the data in sort step in batches with size of in-memory capability of the machine.
For suppose if machine can allocate 4 GB to process data in-memory, then Sort step can sorts
the data with batch size of 2GB and gives it to the data writer step. By the time data writer
step consumes the data, sort step receives and sorts the data. So here all steps are continuously
working and absolutely there is no disk IO in sort step.

So there would not be any waiting of data writer step for sort step, As and when sort step
sorts the data in memory data writer can start writing it.
It can significantly improves the performance.

Advantages:
Increases the loading performance as there is no intermediate IO and no blocking of Sort step.
There is no extra effort for compaction, the current flow can handle it.

Disadvantages:
Number of driver side btrees will increase. So the memory might increase but it could be controlled
by current LRU cache implementation.

  was:
Hi,
Current Problem:
Sort step is major issue as it is blocking step. It needs to receive all data and write down
the sort temp files to disk, after that only data writer step can start.

Solution: 
Make sort step as non blocking step so it avoids waiting of Data writer step.
Process the data in sort step in batches with size of in-memory capability of the machine.
For suppose if machine can allocate 4 GB to process data in-memory, then Sort step can sorts
the data with batch size of 2GB and gives it to the data writer step. By the time data writer
step consumes the data, sort step receives and sorts the data. So here all steps are continuously
working and absolutely there is no disk IO in sort step.

So there would not be any waiting of data writer step for sort step, As and when sort step
sorts the data in memory data writer can start writing it.
It can significantly improves the performance.

Advantages:
Increases the loading performance as there is no intermediate IO and no blocking of Sort step.
There is no extra effort for compaction, the current flow can handle it.

Disadvantages:
Number of driver side btrees will increase. So the memory might increase but it could be controlled
by current LRU cache implementation.


> Add batch sort to improve the loading performance
> -------------------------------------------------
>
>                 Key: CARBONDATA-742
>                 URL: https://issues.apache.org/jira/browse/CARBONDATA-742
>             Project: CarbonData
>          Issue Type: Improvement
>            Reporter: Ravindra Pesala
>
> Current Problem:
> Sort step is major issue as it is blocking step. It needs to receive all data and write
down the sort temp files to disk, after that only data writer step can start.
> Solution: 
> Make sort step as non blocking step so it avoids waiting of Data writer step.
> Process the data in sort step in batches with size of in-memory capability of the machine.
For suppose if machine can allocate 4 GB to process data in-memory, then Sort step can sorts
the data with batch size of 2GB and gives it to the data writer step. By the time data writer
step consumes the data, sort step receives and sorts the data. So here all steps are continuously
working and absolutely there is no disk IO in sort step.
> So there would not be any waiting of data writer step for sort step, As and when sort
step sorts the data in memory data writer can start writing it.
> It can significantly improves the performance.
> Advantages:
> Increases the loading performance as there is no intermediate IO and no blocking of Sort
step.
> There is no extra effort for compaction, the current flow can handle it.
> Disadvantages:
> Number of driver side btrees will increase. So the memory might increase but it could
be controlled by current LRU cache implementation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message