hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sanjay Subramanian <sanjaysubraman...@yahoo.com>
Subject Using Hive as a file comparison and grep-ping tool
Date Mon, 20 Apr 2015 17:41:22 GMT
hey guys
As data wranglers and programmers we often need quick tools. One such tool I need almost everyday
is one that greps a file based on contents of another file. One can write this in perl, python
but since I am already using hadoop ecosystem extensively, I said why not do this in Hive
? 
Perhaps you guys already know this and have better solutions....nevertheless :-) here goes...

Best regards
sanjay(Hive super-fan)
I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/using-hive-as-a-file-comparison-and-grep-ping-tool/
In case the blog URL does not work for any reason, here is the logic
Using Hive as a file comparison and grep-ping tool==================================================1.
Logon to your linux terminal where u run Hive queries from
2. Create a database called "myutils" in Hive   Create two hive tables myutils.file1 and
myutils.file2 in Hive    - each of these tables will have a partition called "fn"    
----> fn is short for "filename"    - each of these tables will have just one column
called "ln" ----> ln is short for "line"   An easy script to help do that would be as
follows     for r in 1 2 ; do hive -e "CREATE DATABASE IF NOT EXISTS myutils; USE myutils;
DROP TABLE IF EXISTS file${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED
BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';"; done
3. Create a permanent base location folder in HDFS   hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/ 
 hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file2/   USECASE 1 : ===========Search
if a bunch of IP addresses exist in another file containing (larger) bunch of IPs
[1] registeredIPs.txt    10.456.34.90    123.675.654.1    21.87.657.456    234.109.34.234 
      visitorIPs.txt    10.456.34.90    12.367.54.23    218.7.657.456    23.4.109.3 
 [2] Output which IPs in File1 are present in File2
[3] Put each file in a separate HDFS location        hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIPs.txt 
  hdfs dfs -put VisitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
    hdfs dfs -put registeredIPs.txt  /workspace/myutils/filecomparator/file1/registeredIPs.txt 
  hdfs dfs -put visitorIPs.txt  /workspace/myutils/filecomparator/file1/visitorIPs.txt
[4] Add partition to  myutils.file1    For simplicity keep the partition names identical
to the file names themselves      hive -e "USE myutils; ALTER TABLE file1 ADD PARTITION(ln='registeredIPs.txt')
LOCATION '/workspace/myutils/filecomparator/file1/registeredIPs.txt'"
    hive -e "USE myutils; ALTER TABLE file2 ADD PARTITION(ln='visitorIPs.txt') LOCATION
'/workspace/myutils/filecomparator/file2/visitorIPs.txt'"    [5] Check that partitions
can be accesd by Hive
    # This should give u the same answer as    # wc -l registeredIPs.txt    hive -e
"select count(*) from myutils.file1 where fn='registeredIPs.txt'"
    # This should give u the same answer as    # wc -l visitorIPs.txt    hive -e "select
count(*) from myutils.file2 where fn='visitorIPs.txt'"
[6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt
# This dumps to a local file systemhive -e "SELECT f1.ln FROM (SELECT ln FROM utils.file1
WHERE fn='registeredIPs.txt') f1  JOIN (select ln from myutils.file2 WHERE fn='visitorIPs.txt')
f2  ON trim(f1.ln) = trim(f2.ln)" > ./registered_in_visitors_list.txt
# This dumps to a new "internally-managed-by-hive" table # Make sure u already dont have
some valuable hive table called "myutils.registered_in_visitors_list" - else this will overwrite
that hive table with the results of this hive query hive -e "USE myutils; DROP TABLE IF EXITS
registered_in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list AS SELECT
f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT ln
FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)"
# This dumps to a directory on HDFS# Make sure u already dont have some valuable directory
called "registered_in_visitors_list" - else this will overwrite that director and all its
contents with the results of this hive query hive -e "INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registered_in_visitors_list'
SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn='registeredIPs.txt') f1  JOIN (SELECT
ln FROM myutils.file2 WHERE fn='visitorIPs.txt') f2  ON trim(f1.ln) = trim(f2.ln)"


 
  
Mime
View raw message