carbondata-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jarck (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CARBONDATA-754) order by query's performance is very bad
Date Fri, 10 Mar 2017 02:40:38 GMT
Jarck created CARBONDATA-754:
--------------------------------

             Summary: order by query's performance is very bad
                 Key: CARBONDATA-754
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-754
             Project: CarbonData
          Issue Type: Improvement
          Components: core, spark-integration
            Reporter: Jarck
            Assignee: Jarck


currently the order by dimension query's performance is very bad if there is no filter or
filtered data is still to large. 
if I was not  wrong, it read all  related data in carbon scan physical level,  decode the
sort dimension's data  and sort all of them in spark sql sort physical  plan.

I think we can optimize as below:

1. push down sort (+limit) to carbon scan 

2. leverage the dimension's stored by nature order feature in blocklet level to get a sorted
data in each partition

3. implements merge-sort/TopN in the spark's sort physical plan

actually I haveI optimized for  "order by only 1 dimension + limit" base on branch 0.2. The
performance is much better.
sort by 1 dimension +limit 10000  in 100 million data , it only take less than 1 second to
get  and print the result.





1. push down






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message