Return-Path: X-Original-To: apmail-hive-user-archive@www.apache.org Delivered-To: apmail-hive-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B755117D23 for ; Mon, 20 Apr 2015 17:43:08 +0000 (UTC) Received: (qmail 53986 invoked by uid 500); 20 Apr 2015 17:43:06 -0000 Delivered-To: apmail-hive-user-archive@hive.apache.org Received: (qmail 53924 invoked by uid 500); 20 Apr 2015 17:43:06 -0000 Mailing-List: contact user-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hive.apache.org Delivered-To: mailing list user@hive.apache.org Received: (qmail 53914 invoked by uid 99); 20 Apr 2015 17:43:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2015 17:43:06 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FORGED_YAHOO_RCVD,HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: message received from 54.76.25.247 which is an MX secondary for user@hive.apache.org) Received: from [54.76.25.247] (HELO mx1-eu-west.apache.org) (54.76.25.247) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 20 Apr 2015 17:42:38 +0000 Received: from nm29-vm6.bullet.mail.gq1.yahoo.com (nm29-vm6.bullet.mail.gq1.yahoo.com [98.136.216.181]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 4932525F54 for ; Mon, 20 Apr 2015 17:42:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1429551686; bh=zQ4hfW5JjlQgRYgpf19K3x95V841ROWmC0RrAd6mOkE=; h=Date:From:Reply-To:To:In-Reply-To:References:Subject:From:Subject; b=W15NxgfjTfg05kHfysn5jxLccdu1C6mXFzwc0esDxfD/t2VXMPjApTXnQKvRx/eLPyZYYnea/IvuEw9Uy7sRAuijHeklA2kJa1jU7CLe01EgmBJS3dYA3CWZ3NBA3p4kzkTIsZzyzJ7N19I6Pi8Yt9ZCsfLYTDhbSHCem3UV4O2HfyU7o1C8yc2ren8v+trvCM3nIE3n7rc6VZzO3gvx72RqgQeIRFLNO3Zs6Ku9QofQ06YZAW0lDgjuwaiifoBU/PNXZVrfciGSKJHzFHUCQTIIESJH+szzT4XDhuR/VRKjOlgFWq2xUwIyoqhtWAhVawlj/eNneECcO6iZf3/eIw== Received: from [98.137.12.175] by nm29.bullet.mail.gq1.yahoo.com with NNFMP; 20 Apr 2015 17:41:26 -0000 Received: from [98.137.12.236] by tm14.bullet.mail.gq1.yahoo.com with NNFMP; 20 Apr 2015 17:41:26 -0000 Received: from [127.0.0.1] by omp1044.mail.gq1.yahoo.com with NNFMP; 20 Apr 2015 17:41:26 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 204025.97345.bm@omp1044.mail.gq1.yahoo.com X-YMail-OSG: xldvl0cVM1kQGNL9018NUEvwscRAqsqmbs3Wphmea0imJifBrS_novb2jikYIi9 X0cWWJTScbfSZ3ksUxDlzFHyPv_McuHXAkoJxuJkA5jKS3Gi7QqBqgZbo98kJYAWaWUw7firSuIC 8btGDAe0eCMsMCzkr11B59MiJQBI0FhuO_GDrenUaQNR_C_gnuUU09q0bcMHGOt0unuRnZogNodh Ftu1vv9ffcj6ZzIqznyOQaZn8hAWHvjBxow5OKJAjoSdRlG.Bwu2cNu6P1LojpIM4fclcq2oSSGv xW4odGS_jFbmqdp9_BEOkM91xxX.umcjNf1nP_T7UhlsDusy2UT1sXYOmJnIVclL2p1MJ8JcSTSH Gyn9ElUStOb2o1W2VgcoPal0_Skn80NCozt_I_4mA0LLHhsuHhxaeHjJM8dz0ep0fTMge7mYsxmy YU4BaEX6qUFP9ij6r6gVv6iB_SsMrFfpWG5ZVfLI6Pc.dT3PhzmMpLpmUFRj0TKu0180yIZngVIV SEle3WtPcUg-- Received: by 98.137.12.49; Mon, 20 Apr 2015 17:41:25 +0000 Date: Mon, 20 Apr 2015 17:41:22 +0000 (UTC) From: Sanjay Subramanian Reply-To: Sanjay Subramanian To: "user@hive.apache.org" Message-ID: <1444032354.231374.1429551682665.JavaMail.yahoo@mail.yahoo.com> In-Reply-To: <553525B1.7050104@gmail.com> References: <553525B1.7050104@gmail.com> Subject: Using Hive as a file comparison and grep-ping tool MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_231372_344757659.1429551682654" X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_231372_344757659.1429551682654 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable hey guys As data wranglers and programmers we often need quick tools. One such tool = I need almost everyday is one that greps a file based on contents of anothe= r file. One can write this in perl, python but since I am already using had= oop ecosystem extensively, I said why not do this in Hive ?=C2=A0 Perhaps you guys already know this and have better solutions....nevertheles= s :-) here goes... Best regards sanjay(Hive=C2=A0super-fan) I just posted this on my bloghttps://bigdatalatte.wordpress.com/2015/04/20/= using-hive-as-a-file-comparison-and-grep-ping-tool/ In case the blog URL does not work for any reason, here is the logic Using Hive as a file comparison and grep-ping tool=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D1. Logon to your linux t= erminal where u run Hive queries from 2. Create a database called "myutils" in Hive=C2=A0 =C2=A0Create two hive t= ables myutils.file1 and myutils.file2 in Hive=C2=A0 =C2=A0 - each of these = tables will have a partition called "fn" =C2=A0 =C2=A0 ----> fn is short fo= r "filename"=C2=A0 =C2=A0 - each of these tables will have just one column = called "ln" ----> ln is short for "line"=C2=A0 =C2=A0An easy script to help= do that would be as follows=C2=A0=C2=A0 =C2=A0 for r in 1 2 ; do hive -e "= CREATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS fi= le${r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED= BY (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';"; done 3. Create a permanent base location folder in HDFS=C2=A0 =C2=A0hdfs dfs -mk= dir -p /workspace/myutils/filecomparator/file1/=C2=A0 =C2=A0hdfs dfs -mkdir= -p /workspace/myutils/filecomparator/file2/=C2=A0 =C2=A0USECASE 1 :=C2=A0= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3DSearch if a bunch of IP addresses exist in= another file containing (larger) bunch of IPs [1] registeredIPs.txt=C2=A0 =C2=A0 10.456.34.90=C2=A0 =C2=A0 123.675.654.1= =C2=A0 =C2=A0 21.87.657.456=C2=A0 =C2=A0 234.109.34.234=C2=A0 =C2=A0=C2=A0= =C2=A0 =C2=A0 visitorIPs.txt=C2=A0 =C2=A0 10.456.34.90=C2=A0 =C2=A0 12.367.= 54.23=C2=A0 =C2=A0 218.7.657.456=C2=A0 =C2=A0 23.4.109.3=C2=A0 =C2=A0[2] Ou= tput which IPs in File1 are present in File2 [3] Put each file in a separate HDFS location=C2=A0 =C2=A0=C2=A0=C2=A0 =C2= =A0 hdfs dfs -mkdir -p /workspace/myutils/filecomparator/file1/registeredIP= s.txt=C2=A0 =C2=A0 hdfs dfs -put VisitorIPs.txt =C2=A0/workspace/myutils/fi= lecomparator/file1/visitorIPs.txt =C2=A0 =C2=A0 hdfs dfs -put registeredIPs.txt =C2=A0/workspace/myutils/file= comparator/file1/registeredIPs.txt=C2=A0 =C2=A0 hdfs dfs -put visitorIPs.tx= t =C2=A0/workspace/myutils/filecomparator/file1/visitorIPs.txt [4] Add partition to =C2=A0myutils.file1=C2=A0 =C2=A0 For simplicity keep t= he partition names identical to the file names themselves=C2=A0=C2=A0=C2=A0= =C2=A0 hive -e "USE myutils; ALTER TABLE file1 ADD PARTITION(ln=3D'registe= redIPs.txt') LOCATION '/workspace/myutils/filecomparator/file1/registeredIP= s.txt'" =C2=A0 =C2=A0 hive -e "USE myutils; ALTER TABLE file2 ADD PARTITION(ln=3D'v= isitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIP= s.txt'"=C2=A0 =C2=A0=C2=A0[5] Check that partitions can be accesd by Hive =C2=A0 =C2=A0 # This should give u the same answer as=C2=A0 =C2=A0 # wc -l = registeredIPs.txt=C2=A0 =C2=A0 hive -e "select count(*) from myutils.file1 = where fn=3D'registeredIPs.txt'" =C2=A0 =C2=A0 # This should give u the same answer as=C2=A0 =C2=A0 # wc -l = visitorIPs.txt=C2=A0 =C2=A0 hive -e "select count(*) from myutils.file2 whe= re fn=3D'visitorIPs.txt'" [6] Count the number of IPs in registeredIPs.txt that are in visitorIPs.txt # This dumps to a local file systemhive -e "SELECT f1.ln FROM (SELECT ln FR= OM utils.file1 WHERE fn=3D'registeredIPs.txt') f1 =C2=A0JOIN (select ln fro= m myutils.file2 WHERE fn=3D'visitorIPs.txt') f2 =C2=A0ON trim(f1.ln) =3D tr= im(f2.ln)" > ./registered_in_visitors_list.txt # This dumps to a new "internally-managed-by-hive" table=C2=A0# Make sure u= already dont have some valuable hive table called "myutils.registered_in_v= isitors_list" - else this will overwrite that hive table with the results o= f this hive query=C2=A0hive -e "USE myutils; DROP TABLE IF EXITS registered= _in_visitors_list; CREATE TABLE if not exists registered_in_visitors_list A= S SELECT f1.ln FROM (select ln FROM utils.file1 WHERE fn=3D'registeredIPs.t= xt') f1 =C2=A0JOIN (SELECT ln FROM myutils.file2 WHERE fn=3D'visitorIPs.txt= ') f2 =C2=A0ON trim(f1.ln) =3D trim(f2.ln)" # This dumps to a directory on HDFS# Make sure u already dont have some val= uable directory called "registered_in_visitors_list" - else this will overw= rite that director and all its contents with the results of this hive query= =C2=A0hive -e "INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparato= r/registered_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file= 1 WHERE fn=3D'registeredIPs.txt') f1 =C2=A0JOIN (SELECT ln FROM myutils.fil= e2 WHERE fn=3D'visitorIPs.txt') f2 =C2=A0ON trim(f1.ln) =3D trim(f2.ln)" =C2=A0 =C2=A0 ------=_Part_231372_344757659.1429551682654 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
hey guys

As data wranglers and programmers we oft= en need quick tools. One such tool I need almost everyday is one that greps= a file based on contents of another file. One can write this in perl, pyth= on but since I am already using hadoop ecosystem extensively, I said why no= t do this in Hive ? 

Perhaps you guys already know this and have better solutions....neverthe= less :-) here goes...

Best regards

sanjay<= /div>
(Hive sup= er-fan)

I just posted t= his on my blog

In case the blog URL does not work for any reason, here is the logic

Usi= ng Hive as a file comparison and grep-ping tool
=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
1. Logon = to your linux terminal where u run Hive queries from

2. Create a database called "myutils" in Hive
=    Create two hive tables myutils.file1 and myutils.file2 in Hive=
    - each of these tables will have a partition called = "fn"     ----> fn is short for "filename"
   = ; - each of these tables will have just one column called "ln" ----> ln = is short for "line"
   An easy script to help do that wou= ld be as follows 
    for r in 1 2 ; do hive -e "CRE= ATE DATABASE IF NOT EXISTS myutils; USE myutils; DROP TABLE IF EXISTS file$= {r}; CREATE EXTERNAL TABLE IF NOT EXISTS file${r}(ln STRING) PARTITIONED BY= (fn STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';"; done

3. Create a permanent base location fol= der in HDFS
   hdfs dfs -mkdir -p /workspace/myutils/file= comparator/file1/
   hdfs dfs -mkdir -p /workspace/myutil= s/filecomparator/file2/
   
USECASE 1 : 
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Search if a bunch of IP add= resses exist in another file containing (larger) bunch of IPs

[1] registeredIPs.txt
    = 10.456.34.90
    123.675.654.1
    21.8= 7.657.456
    234.109.34.234
    <= /div>
    visitorIPs.txt
    10.456.34.90
    12.367.54.23
    218.7.657.456
<= div id=3D"yui_3_16_0_1_1429551324693_3396" dir=3D"ltr" class=3D"" style=3D"= ">    23.4.109.3
   
[2] Output which I= Ps in File1 are present in File2

[3] Put each file in a separate HDFS location
    <= /div>
    hdfs dfs -mkdir -p /workspace/myutils/filecomparator/= file1/registeredIPs.txt
    hdfs dfs -put VisitorIPs.txt =  /workspace/myutils/filecomparator/file1/visitorIPs.txt

    hdfs dfs -put registeredIPs.tx= t  /workspace/myutils/filecomparator/file1/registeredIPs.txt
&= nbsp;   hdfs dfs -put visitorIPs.txt  /workspace/myutils/filecomp= arator/file1/visitorIPs.txt

[4] = Add partition to  myutils.file1
    For simplicity k= eep the partition names identical to the file names themselves
&nbs= p; 
    hive -e "USE myutils; ALTER TABLE file1 ADD = PARTITION(ln=3D'registeredIPs.txt') LOCATION '/workspace/myutils/filecompar= ator/file1/registeredIPs.txt'"

&= nbsp;   hive -e "USE myutils; ALTER TABLE file2 ADD PARTITION(ln=3D'vi= sitorIPs.txt') LOCATION '/workspace/myutils/filecomparator/file2/visitorIPs= .txt'"
    
[5] Check that partitions can = be accesd by Hive

    = # This should give u the same answer as
    # wc -l regis= teredIPs.txt
    hive -e "select count(*) from myutils.fi= le1 where fn=3D'registeredIPs.txt'"

    # This should give u the same answer as
  =   # wc -l visitorIPs.txt
    hive -e "select count(*= ) from myutils.file2 where fn=3D'visitorIPs.txt'"

[6] Count the number of IPs in registeredIPs.txt that are= in visitorIPs.txt

# This dumps = to a local file system
hive -e "SELECT f1.ln FROM (SELECT ln FROM u= tils.file1 WHERE fn=3D'registeredIPs.txt') f1  JOIN (select ln from my= utils.file2 WHERE fn=3D'visitorIPs.txt') f2  ON trim(f1.ln) =3D trim(f= 2.ln)" > ./registered_in_visitors_list.txt

# This dumps to a new "internally-managed-by-hive" table&n= bsp;
# Make sure u already dont have some valuable hive table calle= d "myutils.registered_in_visitors_list" - else this will overwrite that hiv= e table with the results of this hive query 
hive -e "USE myut= ils; DROP TABLE IF EXITS registered_in_visitors_list; CREATE TABLE if not e= xists registered_in_visitors_list AS SELECT f1.ln FROM (select ln FROM util= s.file1 WHERE fn=3D'registeredIPs.txt') f1  JOIN (SELECT ln FROM myuti= ls.file2 WHERE fn=3D'visitorIPs.txt') f2  ON trim(f1.ln) =3D trim(f2.l= n)"

# This dumps to a directory = on HDFS
# Make sure u already dont have some valuable directory ca= lled "registered_in_visitors_list" - else this will overwrite that director= and all its contents with the results of this hive query 
hiv= e -e "INSERT OVERWRITE DIRECTORY '/workspace/myutils/filecomparator/registe= red_in_visitors_list' SELECT f1.ln FROM (select ln FROM utils.file1 WHERE f= n=3D'registeredIPs.txt') f1  JOIN (SELECT ln FROM myutils.file2 WHERE = fn=3D'visitorIPs.txt') f2  ON trim(f1.ln) =3D trim(f2.ln)"


 
 
------=_Part_231372_344757659.1429551682654--