griffin-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Azhar (Jira)" <j...@apache.org>
Subject [jira] [Updated] (GRIFFIN-335) Hive Connector: Ability to Use "group by" caluse
Date Sun, 12 Jul 2020 03:56:00 GMT

     [ https://issues.apache.org/jira/browse/GRIFFIN-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Azhar updated GRIFFIN-335:
--------------------------
    Description: 
*Background:*

Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334 |https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and
https://issues.apache.org/jira/browse/GRIFFIN-333 .

 If we have the ability to select specific columns, it will open the door to use SQLbase
aggregation, further reducing volume of data from Hive sources.

*Proposed Improvement:*
 So, I propose the feature to allow Hive connector to able to use SQL based aggregations.

 

Let's say we have source and target tables that have data like below.

src:
{code:java}
------------------------
|employee_id   |country|
------------------------
|1             | NZ    |
|2             | DE    |
|3             | DE    |
|4             | NZ    |
|5             | DE    |
....
....
------------------------
{code}
tgt:
{code:java}
------------------------
|total_employee|country|
------------------------
|10            | NZ    |
|11            | DE    |
------------------------
{code}
Then we can perform `accuracy` check [ `"rule":"src.total_employee = tgt.total_employee and
src.country = tgt.country "` ]  directly  like below using `columns` and `groupby` clauses
for source table:
{code:java}
      {
         "name":"src",
         "connector":{
            "type":"hive",
            "config":{
               "database":"mydatabase",
               "table.name":"mytable",
               "columns": "count(*) total_employee, country",
               "groupby": "country",
               "where":""
            }
         }
      }
{code}

  was:
*Background:*

Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334 |https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and
https://issues.apache.org/jira/browse/GRIFFIN-333 .

 If we have the ability to select specific columns, it will open the door to use SQLbase
aggregation, further reducing volume of data from Hive sources.

*Proposed Improvement:*
So, I propose the feature to allow Hive connector to able to use SQL based aggregations.

 

Let's say we have source and target tables that have data like below.

src:
{code:java}
------------------------
|employee_id   |country|
------------------------
|1             | NZ    |
|2             | DE    |
|3             | DE    |
|4             | NZ    |
|5             | DE    |
....
....
------------------------
{code}
tgt:
{code:java}
------------------------
|total_employee|country|
------------------------
|10            | NZ    |
|11            | DE    |
------------------------
{code}
Then we can perform `accuracy` check directly like below using `columns` and `groupby` clauses
for source table:
{code:java}
      {
         "name":"src",
         "connector":{
            "type":"hive",
            "config":{
               "database":"mydatabase",
               "table.name":"mytable",
               "columns": "count(*) total_employee, country",
               "groupby": "country",
               "where":""
            }
         }
      }
{code}


> Hive Connector: Ability to Use "group by" caluse
> ------------------------------------------------
>
>                 Key: GRIFFIN-335
>                 URL: https://issues.apache.org/jira/browse/GRIFFIN-335
>             Project: Griffin
>          Issue Type: Improvement
>          Components: accuracy-batch
>    Affects Versions: 0.6.0
>            Reporter: Azhar
>            Priority: Major
>              Labels: columns, groupby, hive
>
> *Background:*
> Refer to [https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-334 |https://issues.apache.org/jira/projects/GRIFFIN/issues/GRIFFIN-332]and
https://issues.apache.org/jira/browse/GRIFFIN-333 .
>  If we have the ability to select specific columns, it will open the door to use SQLbase
aggregation, further reducing volume of data from Hive sources.
> *Proposed Improvement:*
>  So, I propose the feature to allow Hive connector to able to use SQL based aggregations.
>  
> Let's say we have source and target tables that have data like below.
> src:
> {code:java}
> ------------------------
> |employee_id   |country|
> ------------------------
> |1             | NZ    |
> |2             | DE    |
> |3             | DE    |
> |4             | NZ    |
> |5             | DE    |
> ....
> ....
> ------------------------
> {code}
> tgt:
> {code:java}
> ------------------------
> |total_employee|country|
> ------------------------
> |10            | NZ    |
> |11            | DE    |
> ------------------------
> {code}
> Then we can perform `accuracy` check [ `"rule":"src.total_employee = tgt.total_employee
and src.country = tgt.country "` ]  directly  like below using `columns` and `groupby` clauses
for source table:
> {code:java}
>       {
>          "name":"src",
>          "connector":{
>             "type":"hive",
>             "config":{
>                "database":"mydatabase",
>                "table.name":"mytable",
>                "columns": "count(*) total_employee, country",
>                "groupby": "country",
>                "where":""
>             }
>          }
>       }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message