Return-Path: X-Original-To: apmail-madlib-dev-archive@minotaur.apache.org Delivered-To: apmail-madlib-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D8B8E18E9E for ; Thu, 24 Mar 2016 23:02:02 +0000 (UTC) Received: (qmail 3089 invoked by uid 500); 24 Mar 2016 23:02:02 -0000 Delivered-To: apmail-madlib-dev-archive@madlib.apache.org Received: (qmail 3048 invoked by uid 500); 24 Mar 2016 23:02:02 -0000 Mailing-List: contact dev-help@madlib.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@madlib.incubator.apache.org Delivered-To: mailing list dev@madlib.incubator.apache.org Received: (qmail 3027 invoked by uid 99); 24 Mar 2016 23:02:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2016 23:02:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 266E4C2E8D for ; Thu, 24 Mar 2016 23:02:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -4.03 X-Spam-Level: X-Spam-Status: No, score=-4.03 tagged_above=-999 required=6.31 tests=[KAM_LAZY_DOMAIN_SECURITY=1, RCVD_IN_DNSWL_HI=-5, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, T_RP_MATCHES_RCVD=-0.01] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id WqDX1RtFHg4S for ; Thu, 24 Mar 2016 23:02:00 +0000 (UTC) Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with SMTP id DA6ED5F246 for ; Thu, 24 Mar 2016 23:01:59 +0000 (UTC) Received: (qmail 3007 invoked by uid 99); 24 Mar 2016 23:01:58 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Mar 2016 23:01:58 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 94BCDDFB79; Thu, 24 Mar 2016 23:01:58 +0000 (UTC) From: iyerr3 To: dev@madlib.incubator.apache.org Reply-To: dev@madlib.incubator.apache.org References: In-Reply-To: Subject: [GitHub] incubator-madlib pull request: Path: Return results for each match Content-Type: text/plain Message-Id: <20160324230158.94BCDDFB79@git1-us-west.apache.org> Date: Thu, 24 Mar 2016 23:01:58 +0000 (UTC) Github user iyerr3 commented on a diff in the pull request: https://github.com/apache/incubator-madlib/pull/29#discussion_r57401060 --- Diff: src/ports/postgres/modules/utilities/path.py_in --- @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr, # string produced by concatenating the symbols. The exact rows that # produce the match are identified by correlating the matched string # indices with another array containing row ids. - # - # matched_partitions: For each partition (group), concatenate all symbols - # into a single string (sym_str). Keep corresponding ids in an array in the - # same order as the symbols. This is performed only for partitions - # that contain a match. - # build_multiple_matched_rows: - # q1: Split sym_str into an array containing the lengths of the - # strings between the matches. - # q2: Store lengths of matches into an array - # q3: Merge q1 and q2 and unnest the arrays (ensuring same length). - # Also right shift the matches array. - # q4: Compute the cumulative sum of the arrays. + + # matched_partitions: For each partition, concatenate all symbols + # into a single string (sym_str). Keep corresponding ids in an + # array (match_to_row_id) in the same order as the symbols. + # This is performed only for partitions that contain a match. + match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else "match_id" + symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol" plpy.execute(""" CREATE TEMP TABLE {matched_partitions} AS SELECT {p_col_name_str}, - array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') as sym_str, - array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids + array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') as sym_str, + array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id} FROM {input_with_id} + WHERE {short_sym_name_str} is NOT NULL GROUP BY {p_col_name_str} - HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}), '') ~* '{pattern_expr}' + HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}), '') + ~* '{new_pattern_expr}' """.format(**locals())) + + # length_of_matches: For each partition in matched_partitions: + # - find all matches and compute the lengths of each match. + # - output these lengths in an array (matches), along with + # an array of corresponding rank of each match (match_indices). + length_of_matches = unique_string("match_length_view") + plpy.execute(""" + CREATE VIEW {length_of_matches} AS + SELECT + {p_col_name_str}, + {match_to_row_id}, + array_agg(length(matches) ORDER BY match_index) AS matches, + array_agg(match_index ORDER BY match_index) AS match_indices + FROM ( + SELECT + {p_col_name_str}, + {match_to_row_id}, --- End diff -- yep - don't need it both views, since only 1 is being used downstream. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastructure@apache.org or file a JIRA ticket with INFRA. ---