Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 35DE4200AC0 for ; Tue, 24 May 2016 09:18:54 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 3476B160A27; Tue, 24 May 2016 07:18:54 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 3293D160A11 for ; Tue, 24 May 2016 09:18:53 +0200 (CEST) Received: (qmail 63894 invoked by uid 500); 24 May 2016 07:18:52 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 63883 invoked by uid 99); 24 May 2016 07:18:52 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 May 2016 07:18:52 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 9E9E3C1BD0 for ; Tue, 24 May 2016 07:18:51 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.446 X-Spam-Level: X-Spam-Status: No, score=-4.446 tagged_above=-999 required=6.31 tests=[HTML_MESSAGE=2, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-1.426, SPF_PASS=-0.001, TVD_FW_GRAPHIC_NAME_MID=0.001] autolearn=disabled Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id ktTBf8RktLA2 for ; Tue, 24 May 2016 07:18:49 +0000 (UTC) Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id E966E5F343 for ; Tue, 24 May 2016 07:18:48 +0000 (UTC) Received: from localhost by e31.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 24 May 2016 01:18:48 -0600 Received: from d03dlp02.boulder.ibm.com (9.17.202.178) by e31.co.us.ibm.com (192.168.1.131) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 24 May 2016 01:18:45 -0600 X-IBM-Helo: d03dlp02.boulder.ibm.com X-IBM-MailFrom: mboehm@us.ibm.com X-IBM-RcptTo: dev@systemml.incubator.apache.org Received: from b03cxnp07029.gho.boulder.ibm.com (b03cxnp07029.gho.boulder.ibm.com [9.17.130.16]) by d03dlp02.boulder.ibm.com (Postfix) with ESMTP id CF19A3E4001C for ; Tue, 24 May 2016 01:18:44 -0600 (MDT) Received: from b03ledav005.gho.boulder.ibm.com (b03ledav005.gho.boulder.ibm.com [9.17.130.236]) by b03cxnp07029.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id u4O7IigJ41025626 for ; Tue, 24 May 2016 00:18:44 -0700 Received: from b03ledav005.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id AE1E0BE040 for ; Tue, 24 May 2016 01:18:44 -0600 (MDT) Received: from d50lp01.ny.us.ibm.com (unknown [146.89.104.207]) by b03ledav005.gho.boulder.ibm.com (Postfix) with ESMTPS id 7721BBE03E for ; Tue, 24 May 2016 01:18:44 -0600 (MDT) Received: from localhost by d50lp01.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 24 May 2016 03:18:43 -0400 Received: from smtp.notes.na.collabserv.com (192.155.248.93) by d50lp01.ny.us.ibm.com (158.87.18.20) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-SHA bits=256/256) Tue, 24 May 2016 03:18:41 -0400 X-IBM-Helo: smtp.notes.na.collabserv.com X-IBM-MailFrom: mboehm@us.ibm.com X-IBM-RcptTo: dev@systemml.incubator.apache.org Received: from /spool/local by smtp.notes.na.collabserv.com with smtp.notes.na.collabserv.com ESMTP for from ; Tue, 24 May 2016 07:18:40 -0000 Received: from us1a3-smtp04.a3.dal06.isc4sb.com (10.106.154.237) by smtp.notes.na.collabserv.com (10.106.227.39) with smtp.notes.na.collabserv.com ESMTP; Tue, 24 May 2016 07:18:37 -0000 Received: from us1a3-mail149.a3.dal06.isc4sb.com ([10.146.38.84]) by us1a3-smtp04.a3.dal06.isc4sb.com with ESMTP id 2016052407183645-68728 ; Tue, 24 May 2016 07:18:36 +0000 MIME-Version: 1.0 In-Reply-To: <20160524052329.F406B6A054@b03ledav003.gho.boulder.ibm.com> Subject: Re: Fw: Questions/query about recode / transform in systemML To: dev@systemml.incubator.apache.org From: "Matthias Boehm" Date: Tue, 24 May 2016 00:18:35 -0700 References: <20160524052329.F406B6A054@b03ledav003.gho.boulder.ibm.com> X-KeepSent: 64BA6380:BC1ED2FD-00257FBD:00259390; type=4; name=$KeepSent X-Mailer: IBM Notes Release 9.0.1FP2 SHF37 August 25, 2014 X-LLNOutbound: False X-Disclaimed: 2667 X-TNEFEvaluated: 1 Content-type: multipart/related; Boundary="0__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500" x-cbid: 16052407-8236-0000-0000-000031863C08 X-IBM-ISS-SpamDetectors: Score=0.367945; BY=0; FL=0; FP=0; FZ=0; HX=0; KW=0; PH=0; SC=0.367945; ST=0; TS=0; UL=0; ISC= X-IBM-ISS-DetailInfo: BY=3.00005300; HX=3.00000240; KW=3.00000007; PH=3.00000004; SC=3.00000166; SDB=6.00706561; UDB=6.00327849; UTC=2016-05-24 07:18:38 x-cbparentid: 16052407-3202-0000-0000-0000006A4F76 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused Message-Id: <20160524071844.7721BBE03E@b03ledav005.gho.boulder.ibm.com> X-TM-AS-GCONF: 00 X-Content-Scanned: Fidelis XPS MAILER X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused archived-at: Tue, 24 May 2016 07:18:54 -0000 --0__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500 Content-type: multipart/alternative; Boundary="1__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500" --1__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500 Content-Transfer-Encoding: quoted-printable Content-type: text/plain; charset=US-ASCII thanks for the question Alok - couple of comments: Q1) Independent of the number of columns, we will always do two passes, one to compute the recode maps (distinct values) and one to apply the recode maps to your data. Q2) For the distributed case, we have right now only broadcast-based transform apply operators. This means it will run out of memory/into errors if the recode maps do not fit into MR tasks or Spark's broadcast buffers (2GB because recode maps are not partitioned). However, note that we're currently in the process of adding native support for frames (see SYSTEMML-554) - as part of it, we'll also change transform to exploit the distributed frame representations (SYSTEMML-569), which will already remove some of the existing restrictions. Further fully distributed transform operators are certainly possible too (via join-based plans). Regards, Matthias From: Alok Singh/San Francisco/IBM@IBMUS To: dev@systemml.incubator.apache.org Date: 05/23/2016 10:32 PM Subject: Fw: Questions/query about recode / transform in systemML Hi Sending it to the dev list as per Matthias suggestions Alok ----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM ----- From: Matthias Boehm/Almaden/IBM To: Alok Singh/San Francisco/IBM@IBMUS Cc: Arvind Surve/San Jose/IBM@IBMUS Date: 05/23/2016 09:02 PM Subject: Re: Questions/query about recode / transform in systemML Hi Alok, would you mind posting this question on our dev mailing list such that other people also benefit from it? Thanks. Regards, Matthias From: Alok Singh/San Francisco/IBM To: Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/San Jose/IBM@IBMUS Date: 05/23/2016 07:19 PM Subject: Questions/query about recode / transform in systemML Hi Matthias and Arvind. I had the questions about the internals and how the scan happens in systemML transform Question 1 Lets consider an example of dataframe as follows (first line is schema) userID , county, state =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D 1, sanJose,CA 2, santaClara,CA 3,sanJose,CA 4,alameda,CA 5,minnepolis,MN we can see that uniq for county is {sanJose, alameda, minnepolis} and for state is {CA,MN} so example as the doc at http://apache.github.io/incubator-systemml/files/dml-language-reference/dat= a.spec.json user pass in the spec file as "recode": ["country", "state"] then the question is how many passes systemML will make for the dataframe .i.e in general the recode algo would be for column in columns: step 1) find uniq for the column step 2) apply recode value for column so does it mean , we would need 2*count(columns) pass on the dataframe? if not , then how systemML internally doesn't do more than 2*count(columns)? Question 2 Lets consider another dataframe as follows (first line is schema) random=5Fstring =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D col1 dsfsdf xcvxcv sdf etc foo Dummy we can definitely see that number of unique for this df will be almost same as number of rows and what if number of rows is 10 trillion and also number of unique for column random=5Fstring is 10 trillion . in that case, the whole uniq data will not fit in the one node. so in that case how does systemML handle it? Thanks for the inputs Alok --1__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500 Content-Transfer-Encoding: quoted-printable Content-type: text/html; charset=US-ASCII Content-Disposition: inline

thanks for the question Alok - couple of comments:

Q1= ) Independent of the number of columns, we will always do two passes, one t= o compute the recode maps (distinct values) and one to apply the recode map= s to your data.

Q2) For the distributed case, we have right now onl= y broadcast-based transform apply operators. This means it will run out of = memory/into errors if the recode maps do not fit into MR tasks or Spark's b= roadcast buffers (2GB because recode maps are not partitioned). However, no= te that we're currently in the process of adding native support for frames = (see SYSTEMML-554) - as part of it, we'll also change transform to exploit = the distributed frame representations (SYSTEMML-569), which will already re= move some of the existing restrictions. Further fully distributed transform= operators are certainly possible too (via join-based plans).

Regard= s,
Matthias

3D"InactiveAlok Singh---05= /23/2016 10:32:03 PM---Hi Sending it to the dev list as per Matthias sugge= stions

From: Alok Singh/San Francisco/IBM@IBMUS
To: dev@systemml.in= cubator.apache.org
Date: = 05/23/2016 10:32 PM
Subject: Fw: Questions/query= about recode / transform in systemML





Hi
=
Sending it to the dev list as per Matthias suggestions

Alok
<= br>----- Forwarded by Alok Singh/San Francisco/IBM on 05/23/2016 10:04 PM <= br>-----

From:   Matthias Boehm/Almaden/IBM
To:    = ; Alok Singh/San Francisco/IBM@IBMUS
Cc:     Arvind Surve/San = Jose/IBM@IBMUS
Date:   05/23/2016 09:02 PM
Subject:    = ;    Re: Questions/query about recode / transform in systemML
=

Hi Alok,

would you mind posting this question on our dev mai= ling list such that
other people also benefit from it? Thanks.

<= br>Regards,
Matthias



From:   Alok Singh/San Francisc= o/IBM
To:     Matthias Boehm/Almaden/IBM@IBMUS, Arvind Surve/S= an Jose/IBM@IBMUS
Date:   05/23/2016 07:19 PM
Subject:   &n= bsp;    Questions/query about recode / transform in systemML
<= br>

Hi Matthias and Arvind.

I had the questions about the in= ternals and how the scan happens in
systemML transform


Que= stion 1

Lets consider an example of dataframe as follows (first lin= e is schema)

userID , county, state
=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D
1, sanJose,CA
2, santaClara,CA
3,sanJose,CA<= br>4,alameda,CA
5,minnepolis,MN


we can see that uniq for co= unty is {sanJose, alameda, minnepolis} and for
state is {CA,MN}
so example as the doc at
http://apach= e.github.io/incubator-systemml/files/dml-language-reference/data.spec.json<= /a>

user pass in the spec file as
"recode": [= "country", "state"]

then the question is how ma= ny passes systemML will make for the dataframe
.i.e in general the recod= e algo would be

for  column  in columns:
  step = 1) find uniq for the column

  step 2) apply recode value &nbs= p;for column


so does it mean , we would need 2*count(columns)= pass on the dataframe?

if not , then how systemML internally doesn= 't do more than
2*count(columns)?

Question 2

Lets consi= der another dataframe as follows (first line is schema)

random=5Fst= ring
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
col1
dsfsdf
xcvxcv
sd= f
etc
foo
Dummy

we can definitely see that number of uniqu= e for this df will be almost
same as number of rows
and what if numb= er of rows is 10 trillion and also number of unique for
column random= =5Fstring is 10 trillion .
in that case, the whole uniq data will not fi= t in the one node. so in that
case how does systemML handle it?

Thanks for the inputs
Alok





--1__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500-- --0__=8FBBF52EDFB615008f9e8a93df938690918c8FBBF52EDFB61500--