From user-return-77541-archive-asf-public=cust-asf.ponee.io@spark.apache.org Sun Mar 17 09:59:12 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 6AEDF180629 for ; Sun, 17 Mar 2019 10:59:11 +0100 (CET) Received: (qmail 17945 invoked by uid 500); 17 Mar 2019 09:59:09 -0000 Mailing-List: contact user-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list user@spark.apache.org Received: (qmail 17934 invoked by uid 99); 17 Mar 2019 09:59:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 17 Mar 2019 09:59:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id ACAF0C273D for ; Sun, 17 Mar 2019 09:59:08 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, MIME_QP_LONG_LINE=0.001, RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id JUBL4uUwOK4Z for ; Sun, 17 Mar 2019 09:59:06 +0000 (UTC) Received: from mail-wm1-f45.google.com (mail-wm1-f45.google.com [209.85.128.45]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 3564A5F54E for ; Sun, 17 Mar 2019 09:50:53 +0000 (UTC) Received: by mail-wm1-f45.google.com with SMTP id u10so7286511wmj.5 for ; Sun, 17 Mar 2019 02:50:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=X+6Je3rRoAg5Ies4GiHmkBHKMQrX0YwaIMA3hDfgmr0=; b=DRGVdalQqR5VrZASgq/b/JFUSQwZRWlJVtl9HBTPmuq735dVTdATTV3nEOr/IYYVJ9 KvIfOdeXUW2e8YrZu1Ua8o4KNt1IlvB6hozv7RlQSqe3x/No/Ck5FRSvFqZStsKgAaEW B7SUIMoI4X+V/eF4/XIAA3QiYEBRCpxBAJS7caEx3283aSzZB3C17w5VEyITVu6MzLxg XUoG9W6N9b9HHhSRdXtpP5zQRMw5ywb0sH3Z3BXcKr8bTRDGj5V+Lbh/iiYW+1fYGdPN 8JUEQ3mOS3hNg/U0EQ7WfCzJctcitsFif7SY5/r8GBnmPs9C6zharbd6laY4fztOBnbV gNEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=X+6Je3rRoAg5Ies4GiHmkBHKMQrX0YwaIMA3hDfgmr0=; b=lIJ2DGGdy5Q7sk+FbwPM0Mfv3lPhFBXd8+lWr5TWSk2sAyE7huqlhSiO8YgdVT9RFR o4ZD+vYdiT8Wp0ZxarutYe9gMltmpP6G4BzA2+JQUPhThdP4izR2pkhePeQuBO7JjCr1 lofEiog4dq5/H7kRajg+T5cSwEwBe/5iooPwyQXfkjuoeNedX4r+DVm/GbCHQE7dom+q SeRQEygkiG/lmhRQKNlczk40RpLg4RVX3Li28s39g1tzOej99zWNicejfdl2RMbeUOjN 5YoZ650bYQvSaTBgM6k0Bock8iVJiPHTf60w1BX9L4sBqVO+ce7xCAFr66xmpGumaFW8 ATnw== X-Gm-Message-State: APjAAAURYt66xnjgt1QfftjGYccx2/aigiP5QVBrqMGsq1n89AowGV7i +Y2JOkNQP2q9PJ1T6oDT1hc= X-Google-Smtp-Source: APXvYqw2mkyKvOump/HSt66Bt/21qSDwZ0UAKCfYftBap2h/VqWok/mIMGlDiJxiFp8Cm/Ql5pQkfQ== X-Received: by 2002:a1c:9ed5:: with SMTP id h204mr4108822wme.35.1552816246965; Sun, 17 Mar 2019 02:50:46 -0700 (PDT) Received: from ?IPv6:2a01:598:b005:a1fc:2c87:800b:5522:2033? ([2a01:598:b005:a1fc:2c87:800b:5522:2033]) by smtp.gmail.com with ESMTPSA id d10sm19873522wrh.83.2019.03.17.02.50.45 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 17 Mar 2019 02:50:45 -0700 (PDT) Content-Type: multipart/alternative; boundary=Apple-Mail-FD8FEDE9-2E56-4BF6-8548-2AD5B94BCA17 Mime-Version: 1.0 (1.0) Subject: Re: Masking username in Spark with regexp_replace and reverse functions From: =?utf-8?Q?J=C3=B6rn_Franke?= X-Mailer: iPhone Mail (16D57) In-Reply-To: Date: Sun, 17 Mar 2019 10:50:43 +0100 Cc: "user @spark" Content-Transfer-Encoding: 7bit Message-Id: References: To: Mich Talebzadeh --Apple-Mail-FD8FEDE9-2E56-4BF6-8548-2AD5B94BCA17 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable For the approach below you have to check for collisions, ie different name l= ead to same masked value. You could hash it. However in order to avoid that one can just try different= hashes you need to include in each name a different random factor.=20 However, the anonymization problem is bigger, because based on other fields a= nd correlation, the individual might still be identifiable. > Am 16.03.2019 um 18:39 schrieb Mich Talebzadeh = : >=20 > Hi, >=20 > I am looking at Description column of a bank statement (CSV download) that= has the following format >=20 > scala> account_table.printSchema > root > |-- TransactionDate: date (nullable =3D true) > |-- TransactionType: string (nullable =3D true) > |-- Description: string (nullable =3D true) > |-- Value: double (nullable =3D true) > |-- Balance: double (nullable =3D true) > |-- AccountName: string (nullable =3D true) > |-- AccountNumber: string (nullable =3D true) >=20 > The column description for BACS payments contains the name of the individu= al who paid into the third party account. I need to mask the name but cannot= simply use a literal as below for all contents of descriptions column! >=20 > f1.withColumn("Description", lit("*** Masked ***")).select('Description.as= ("Who paid") >=20 > So I try the following combination >=20 > f1.select(trim(substring(substring_index('Description, ",", 1),2,50)).as("= name in clear"), > reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_repl= ace('Description, "^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as= ("Masked")).show > +------------------+------+ > | in clear|Masked| > +------------------+------+ > | FATAH SABAH|HXTXXX| > | C HIGGINSON|GIHXXX| > | SOLTA A|XTLOSX| > +------------------+------+ >=20 > This seems to work as it not only masks the name but also makes it consist= ent for all names (in other words, the same username gets the same mask). >=20 > Are there any better alternatives? >=20 > Thanks >=20 > Dr Mich Talebzadeh > =20 > LinkedIn https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2gBxianrbJd6= zP6AcPCCdOABUrV8Pw > =20 > http://talebzadehmich.wordpress.com >=20 > Disclaimer: Use it at your own risk. Any and all responsibility for any lo= ss, damage or destruction of data or any other property which may arise from= relying on this email's technical content is explicitly disclaimed. The aut= hor will in no case be liable for any monetary damages arising from such los= s, damage or destruction. > =20 --Apple-Mail-FD8FEDE9-2E56-4BF6-8548-2AD5B94BCA17 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
For= the approach below you have to check for collisions, ie different name lead= to same masked value.

You c= ould hash it. However in order to avoid that one can just try different hash= es you need to include in each name a different random factor. 

However, the anonymization problem= is bigger, because based on other fields and correlation, the individual mi= ght still be identifiable.

Am 16.03.2019 um 18:39 s= chrieb Mich Talebzadeh <mich= .talebzadeh@gmail.com>:

Hi,

I am looking at Description column of a= bank statement (CSV download) that has the following format

<= /div>
scala&g= t; account_table.printSchema
root
 |-- TransactionDate: date (nul= lable =3D true)
 |-- TransactionType: string (nullable =3D true)
=  |-- Description: string (nullable =3D true)
 |-- Value: double= (nullable =3D true)
 |-- Balance: double (nullable =3D true)
&nb= sp;|-- AccountName: string (nullable =3D true)
 |-- AccountNumber: s= tring (nullable =3D true)


The column descrip= tion for BACS payments contains the name of the individual who paid into the= third party account. I need to mask the name but cannot simply use a litera= l as below for all contents of descriptions column!

f1.withColumn("Descrip= tion", lit("*** Masked ***")).select('Description.as("Who paid")

So I try the fol= lowing combination

f1.select(trim(substring(substring_index('Descrip= tion, ",", 1),2,50)).as("name in clear"),
reverse(regexp_replace(regexp_rep= lace(regexp_replace(substring(regexp_replace('Description, "^['A-Z]", "XX"),= 2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
+------------------+------+
|   =        in clear|Masked|
+------------------= +------+
|       FATAH SABAH|HXTXXX|
|&n= bsp;      C HIGGINSON|GIHXXX|
|   = ;        SOLTA A|XTLOSX|
+------------= ------+------+

This seems to work as it not only masks t= he name but also makes it consistent for all names (in other words, the same= username gets the same mask).

Are there any better= alternatives?

Thanks

Dr Mich Talebzadeh

 

LinkedIn  <= font color=3D"#0000ff">https://www.linkedin.com/profile/view?id=3DAAEAAAAWh2= gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://t= alebzadehmich.wordpress.com


Disclaimer: Use it at y= our own risk. Any and all responsibility for any los= s, damage or destruction of data or any other property which may arise from relying on this email's&n= bsp;technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from s= uch loss, damage or destruction.

 

= --Apple-Mail-FD8FEDE9-2E56-4BF6-8548-2AD5B94BCA17--