hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "wangmeng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-7296) big data approximate processing at a very low cost based on hive sql
Date Fri, 27 Jun 2014 06:25:25 GMT

     [ https://issues.apache.org/jira/browse/HIVE-7296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

wangmeng updated HIVE-7296:
---------------------------

    Description: 
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the collection, such
as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the site visits
of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large memory space and
a long query time.

However ,in many cases, we do not need very accurate results and a small error can be tolerated.
In such case  , we can use  approximate processing  to greatly improve the time and space
efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for these new features
so much if possible. 

So, is there anything I can do ?  Many Thanks.


  was:
For big data analysis, we often need to do the following query and statistics:

1.Cardinality Estimation,   count the number of different elements in the collection, such
as Unique Visitor ,UV)

Now we can use hive-query:
Select distinct(id)  from TestTable ;

2.Frequency Estimation: estimate number of an element is repeated, such as the site visits
of  a user 。

Hive query: select  count(1)  from TestTable where name=”wangmeng”

3.Heavy Hitters, top-k elements: such as top-100 shops 

Hive query: select count(1), name  from TestTable  group by name ;  need UDF……

4.Range Query: for example, to find out the number of  users between 20 to 30

Hive query : select  count(1) from TestTable where age>20 and age <30

5.Membership Query : for example, whether  the user name is already registered?

According to the implementation mechanism of hive , it  will cost too large memory space and
a long query time.

However ,in many cases, we do not need very accurate results and a small error can be tolerated.
In such case  , we can use  approximate processing  to greatly improve the time and space
efficiency.

Now , based  on some theoretical analysis materials ,I want to  do some for these new features
so much if possible. .

I am familiar with hive and  hadoop , and  I have implemented an efficient  storage format
based on hive.( https://github.com/sjtufighter/----Data---Storage--).

So, is there anything I can do ?  Many Thanks.



> big data approximate processing  at a very  low cost  based on hive sql 
> ------------------------------------------------------------------------
>
>                 Key: HIVE-7296
>                 URL: https://issues.apache.org/jira/browse/HIVE-7296
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: wangmeng
>
> For big data analysis, we often need to do the following query and statistics:
> 1.Cardinality Estimation,   count the number of different elements in the collection,
such as Unique Visitor ,UV)
> Now we can use hive-query:
> Select distinct(id)  from TestTable ;
> 2.Frequency Estimation: estimate number of an element is repeated, such as the site visits
of  a user 。
> Hive query: select  count(1)  from TestTable where name=”wangmeng”
> 3.Heavy Hitters, top-k elements: such as top-100 shops 
> Hive query: select count(1), name  from TestTable  group by name ;  need UDF……
> 4.Range Query: for example, to find out the number of  users between 20 to 30
> Hive query : select  count(1) from TestTable where age>20 and age <30
> 5.Membership Query : for example, whether  the user name is already registered?
> According to the implementation mechanism of hive , it  will cost too large memory space
and a long query time.
> However ,in many cases, we do not need very accurate results and a small error can be
tolerated. In such case  , we can use  approximate processing  to greatly improve the time
and space efficiency.
> Now , based  on some theoretical analysis materials ,I want to  do some for these new
features so much if possible. 
> So, is there anything I can do ?  Many Thanks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message