Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id C7809200CD0 for ; Tue, 11 Jul 2017 03:58:55 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id C5F83164E03; Tue, 11 Jul 2017 01:58:55 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 18833164DF5 for ; Tue, 11 Jul 2017 03:58:54 +0200 (CEST) Received: (qmail 72275 invoked by uid 500); 11 Jul 2017 01:58:54 -0000 Mailing-List: contact issues-help@spark.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list issues@spark.apache.org Received: (qmail 72266 invoked by uid 99); 11 Jul 2017 01:58:54 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Jul 2017 01:58:54 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id DDE6C194F86 for ; Tue, 11 Jul 2017 01:58:53 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.002 X-Spam-Level: X-Spam-Status: No, score=-100.002 tagged_above=-999 required=6.31 tests=[RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id XUjerTr-2o_I for ; Tue, 11 Jul 2017 01:58:53 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 9959062959 for ; Tue, 11 Jul 2017 01:23:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 7CEE7E0984 for ; Tue, 11 Jul 2017 01:23:04 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9A130246AD for ; Tue, 11 Jul 2017 01:23:02 +0000 (UTC) Date: Tue, 11 Jul 2017 01:23:02 +0000 (UTC) From: "Fu Shanshan (JIRA)" To: issues@spark.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SPARK-21359) frequency discretizer MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 11 Jul 2017 01:58:56 -0000 [ https://issues.apache.org/jira/browse/SPARK-21359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081467#comment-16081467 ] Fu Shanshan commented on SPARK-21359: ------------------------------------- but why in the example: Array((0, 18.0), (1, 19.0), (2, 8.0), (3, 5.0), (4, 2.2), (5, 1.0), (6, 9.1), (7, 10.1), (8, 1.1), (9, 16.0), (10, 20.0), (11, 20.0)) QuantileDiscretizer result +---+----+------+ | id|hour|result| +---+----+------+ | 0|18.0| 3.0| | 1|19.0| 3.0| | 2| 8.0| 1.0| | 3| 5.0| 1.0| | 4| 2.2| 1.0| | 5| 1.0| 0.0| | 6| 9.1| 2.0| | 7|10.1| 2.0| | 8| 1.1| 0.0| | 9|16.0| 2.0| | 10|20.0| 3.0| | 11|20.0| 3.0| +---+----+------+ for number 18. it belong to bin 3. I thought it is because it makes equal-width bins, so the bin array is (0, 5, 10, 15, 20), so 18 is in the last bin. but my result, for number 18, it should be in bin 2. for equal frequency definition, so the bin array is (-inf, 5.0, 10.1, 19, inf or 20), so 18 in the bin 2, instead of the last bin. Not sure am I misunderstood this questions. Thank you for your patiences. > frequency discretizer > --------------------- > > Key: SPARK-21359 > URL: https://issues.apache.org/jira/browse/SPARK-21359 > Project: Spark > Issue Type: New JIRA Project > Components: ML > Affects Versions: 2.1.1 > Reporter: Fu Shanshan > > Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org For additional commands, e-mail: issues-help@spark.apache.org