hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ramasubramanian Narayanan <ramasubramanian.naraya...@gmail.com>
Subject HIVE or PIG - For building DQ framework
Date Thu, 06 Jul 2017 10:31:43 GMT
Hi All,

Pls help me on the below.

*Use Case :*
Trying to develop a framework to do Data profiling and Data Quality.
Data is stored HIVE table stored in RC format.
No join only considering DQ checks that can be done in a single table.

*Need suggestion :*
Thinking either to use PIG or HIVE for performing Data Quality and
profiling. Need your suggestion on the same. Have listed few highlevel
points which came to my mind.

*Performance *:
- HIVE will perform better or PIG ? In PIG can load the data set into a
variable and can perform many operations on that data set. Will that
improve any performance?
- In HIVE, can have almost 70% of the checks in the same query. Like null,
count, distinct count, duplicate count (total count - distinct count),
length,etc., Even in this case, PIG will perform better or HIVE?

*Coding *:
- Though HIVE is easy to code than PIG, which one is most suitable for
perfoming Data Quality and profiling
*Open source tools:*
- Pls Suggest any open source tools built on Java or some other
technologies which can be integarated with Hadoop without any installation.



regards,
Rams

Mime
View raw message