spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JB Data31 <jbdat...@gmail.com>
Subject Re: Masking username in Spark with regexp_replace and reverse functions
Date Sun, 17 Mar 2019 09:32:06 GMT
Hi,

Why don't add a random regexp in regexp substitution, i.e.
https://onlinerandomtools.com/generate-random-data-from-regexp

@*JB*Δ <http://jbigdata.fr/jbigdata/index.html>



Le sam. 16 mars 2019 à 18:39, Mich Talebzadeh <mich.talebzadeh@gmail.com> a
écrit :

> Hi,
>
> I am looking at Description column of a bank statement (CSV download) that
> has the following format
>
> scala> account_table.printSchema
> root
>  |-- TransactionDate: date (nullable = true)
>  |-- TransactionType: string (nullable = true)
>  |-- Description: string (nullable = true)
>  |-- Value: double (nullable = true)
>  |-- Balance: double (nullable = true)
>  |-- AccountName: string (nullable = true)
>  |-- AccountNumber: string (nullable = true)
>
> The column description for BACS payments contains the name of the
> individual who paid into the third party account. I need to mask the name
> but cannot simply use a literal as below for all contents of descriptions
> column!
>
> f1.withColumn("Description", lit("*** Masked
> ***")).select('Description.as("Who paid")
>
> So I try the following combination
>
> f1.select(trim(substring(substring_index('Description, ",",
> 1),2,50)).as("name in clear"),
> reverse(regexp_replace(regexp_replace(regexp_replace(substring(regexp_replace('Description,
> "^['A-Z]", "XX"),2,6),"[A-F]","X")," ","X"),"[,]","R")).as("Masked")).show
> +------------------+------+
> |          in clear|Masked|
> +------------------+------+
> |       FATAH SABAH|HXTXXX|
> |       C HIGGINSON|GIHXXX|
> |           SOLTA A|XTLOSX|
> +------------------+------+
>
> This seems to work as it not only masks the name but also makes it
> consistent for all names (in other words, the same username gets the same
> mask).
>
> Are there any better alternatives?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Mime
View raw message