hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Simanchal Das (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-14159) sorting of tuple array using multiple field[s]
Date Thu, 08 Sep 2016 03:46:20 GMT

     [ https://issues.apache.org/jira/browse/HIVE-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Simanchal Das updated HIVE-14159:
---------------------------------
    Description: 
Problem Statement:

When we are working with complex structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each tuple have struct
schema.

Suppose here struct schema is like below:
{noformat}
{
	"name": "employee",
	"type": [{
		"type": "record",
		"name": "Employee",
		"namespace": "com.company.Employee",
		"fields": [{
			"name": "empId",
			"type": "int"
		}, {
			"name": "empName",
			"type": "string"
		}, {
			"name": "age",
			"type": "int"
		}, {
			"name": "salary",
			"type": "double"
		}]
	}]
}

{noformat}
Then while running our hive query complex array looks like array of employee objects.
{noformat}
Example: 
	//(array<struct<empId,empName,age,salary>>)
	Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]

{noformat}
When we are implementing business use cases day to day life we are encountering problems like
sorting a tuple array by specific field[s] like empId,name,salary,etc by ASC or DESC order.


Proposal:

I have developed a udf 'sort_array_by' which will sort a tuple array by one or more fields
in ASC or DESC order provided by user ,default is ascending order .
{noformat}
Example:
	1.Select sort_array_by(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
	output: array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
	
	2.Select sort_array_by(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
	output: array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

	3.Select sort_array_by(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
	output: array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
{noformat}

  was:
Problem Statement:

When we are working with complex structure of data like avro.
Most of the times we are encountering array contains multiple tuples and each tuple have struct
schema.

Suppose here struct schema is like below:
{noformat}
{
	"name": "employee",
	"type": [{
		"type": "record",
		"name": "Employee",
		"namespace": "com.company.Employee",
		"fields": [{
			"name": "empId",
			"type": "int"
		}, {
			"name": "empName",
			"type": "string"
		}, {
			"name": "age",
			"type": "int"
		}, {
			"name": "salary",
			"type": "double"
		}]
	}]
}

{noformat}
Then while running our hive query complex array looks like array of employee objects.
{noformat}
Example: 
	//(array<struct<empId,empName,age,salary>>)
	Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]

{noformat}
When we are implementing business use cases day to day life we are encountering problems like
sorting a tuple array by specific field[s] like empId,name,salary,etc by ASC or DESC order.


Proposal:

I have developed a udf 'sort_array_by' which will sort a tuple array by one or more fields
in ASC or DESC order provided by user ,default is ascending order .
{noformat}
Example:
	1.Select sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
	output: array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
	
	2.Select sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
	output: array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]

	3.Select sort_array_field(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
	output: array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
{noformat}


> sorting of tuple array using multiple field[s]
> ----------------------------------------------
>
>                 Key: HIVE-14159
>                 URL: https://issues.apache.org/jira/browse/HIVE-14159
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF
>            Reporter: Simanchal Das
>            Assignee: Simanchal Das
>              Labels: TODOC2.2, patch
>             Fix For: 2.2.0
>
>         Attachments: HIVE-14159.1.patch, HIVE-14159.2.patch, HIVE-14159.3.patch, HIVE-14159.4.patch
>
>
> Problem Statement:
> When we are working with complex structure of data like avro.
> Most of the times we are encountering array contains multiple tuples and each tuple have
struct schema.
> Suppose here struct schema is like below:
> {noformat}
> {
> 	"name": "employee",
> 	"type": [{
> 		"type": "record",
> 		"name": "Employee",
> 		"namespace": "com.company.Employee",
> 		"fields": [{
> 			"name": "empId",
> 			"type": "int"
> 		}, {
> 			"name": "empName",
> 			"type": "string"
> 		}, {
> 			"name": "age",
> 			"type": "int"
> 		}, {
> 			"name": "salary",
> 			"type": "double"
> 		}]
> 	}]
> }
> {noformat}
> Then while running our hive query complex array looks like array of employee objects.
> {noformat}
> Example: 
> 	//(array<struct<empId,empName,age,salary>>)
> 	Array[Employee(100,Foo,20,20990),Employee(500,Boo,30,50990),Employee(700,Harry,25,40990),Employee(100,Tom,35,70990)]
> {noformat}
> When we are implementing business use cases day to day life we are encountering problems
like sorting a tuple array by specific field[s] like empId,name,salary,etc by ASC or DESC
order.
> Proposal:
> I have developed a udf 'sort_array_by' which will sort a tuple array by one or more fields
in ASC or DESC order provided by user ,default is ascending order .
> {noformat}
> Example:
> 	1.Select sort_array_by(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Salary","ASC");
> 	output: array[struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(500,Boo,30,50990),struct(100,Tom,35,70990)]
> 	
> 	2.Select sort_array_by(array[struct(100,Foo,20,20990),struct(500,Boo,30,80990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","ASC");
> 	output: array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> 	3.Select sort_array_by(array[struct(100,Foo,20,20990),struct(500,Boo,30,50990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)],"Name","Salary","Age,"ASC");
> 	output: array[struct(500,Boo,30,50990),struct(500,Boo,30,80990),struct(100,Foo,20,20990),struct(700,Harry,25,40990),struct(100,Tom,35,70990)]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message