hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ajay Chander <hadoopde...@gmail.com>
Subject Re: De-identification_in Hive
Date Thu, 17 Mar 2016 16:02:37 GMT
Thanks for your time Mich! I will try this one out.

On Thursday, March 17, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com>
wrote:

> Then probably the easiest option would be in INSERT/SELECT from external
> table to target table and make that column NULL
>
> Check the VAT column here that I made it NULL
>
> DROP TABLE IF EXISTS stg_t2;
> CREATE EXTERNAL TABLE stg_t2 (
>  INVOICENUMBER string
> ,PAYMENTDATE string
> ,NET string
> ,VAT string
> ,TOTAL string
> )
> COMMENT 'from csv file from excel sheet xxxx'
> ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
> STORED AS TEXTFILE
> LOCATION '/data/stg/table2'
> TBLPROPERTIES ("skip.header.line.count"="1")
> ;
> --3)
> DROP TABLE IF EXISTS t2;
> CREATE TABLE t2 (
>  INVOICENUMBER          INT
> ,PAYMENTDATE            timestamp
> ,NET                    DECIMAL(20,2)
> ,VAT                    DECIMAL(20,2)
> ,TOTAL                  DECIMAL(20,2)
> )
> COMMENT 'from csv file from excel sheet xxxx'
> CLUSTERED BY (INVOICENUMBER) INTO 256 BUCKETS
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB",
> "transactional"="true")
> ;
> --4) Put data in target table. do the conversion and ignore empty rows
> INSERT INTO TABLE t2
> SELECT
>           INVOICENUMBER
>         , CAST(UNIX_TIMESTAMP(paymentdate,'DD/MM/YYYY')*1000 as timestamp)
>         , CAST(REGEXP_REPLACE(net,'[^\\d\\.]','') AS DECIMAL(20,2))
>         , NULL
>         , CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2))
> FROM
> stg_t2
> WHERE
> --        INVOICENUMBER > 0 AND
>         CAST(REGEXP_REPLACE(total,'[^\\d\\.]','') AS DECIMAL(20,2)) > 0.0
> -- Exclude empty rows
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 17 March 2016 at 15:32, Ajay Chander <hadoopdev18@gmail.com
> <javascript:_e(%7B%7D,'cvml','hadoopdev18@gmail.com');>> wrote:
>
>> Mich, I am okay with replacing the columns data with some characters
>> like asterisk. Thanks
>>
>>
>> On Thursday, March 17, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mich.talebzadeh@gmail.com');>> wrote:
>>
>>> Hi Ajay,
>>>
>>> Do you want to be able to unmask it (at any time) or just have it
>>> totally scrambled (for example replace the column with random characters)
>>> in Hive?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 17 March 2016 at 15:14, Ajay Chander <hadoopdev18@gmail.com> wrote:
>>>
>>>> Mich thbaks for looking into this. I have a 'csvfile.txt ' on hdfs. I
>>>> have created an external table 'xyz' to load that data into it. One of the
>>>> columns data 'ssn' needs to be masked. Is there any built in function is
>>>> give that I could use?
>>>>
>>>>
>>>> On Thursday, March 17, 2016, Mich Talebzadeh <mich.talebzadeh@gmail.com>
>>>> wrote:
>>>>
>>>>> Are you loading your CSV file from an External table into Hive table.?
>>>>>
>>>>> Basically you want to scramble that column before putting into Hive
>>>>> table?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 17 March 2016 at 14:37, Ajay Chander <hadoopdev18@gmail.com>
wrote:
>>>>>
>>>>>> Tustin, Is there anyway I can deidentify it in hive ?
>>>>>>
>>>>>>
>>>>>> On Thursday, March 17, 2016, Marcin Tustin <mtustin@handybook.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This is a classic transform-load problem. You'll want to anonymise
>>>>>>> it once before making it available for analysis.
>>>>>>>
>>>>>>> On Thursday, March 17, 2016, Ajay Chander <hadoopdev18@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I have a csv.file which has some sensitive data in a particular
>>>>>>>> column in it.  Now I have to create a table in hive and load
the data into
>>>>>>>> it. But when loading the data I have to make sure that the
data is masked.
>>>>>>>> Is there any built in function is used ch supports this or
do I have to
>>>>>>>> write UDF ? Any suggestions are appreciated. Thanks
>>>>>>>
>>>>>>>
>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>> <http://www.handy.com/careers>
>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>> Handy just raised $50m
>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/>
led
>>>>>>> by Fidelity
>>>>>>>
>>>>>>>
>>>>>
>>>
>

Mime
View raw message