madlib-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mktal <...@git.apache.org>
Subject [GitHub] incubator-madlib pull request: Path: Return results for each match
Date Thu, 24 Mar 2016 18:43:12 GMT
Github user mktal commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/29#discussion_r57367421
  
    --- Diff: src/ports/postgres/modules/utilities/path.py_in ---
    @@ -118,140 +120,175 @@ def path(schema_madlib, source_table, output_table, partition_expr,
             #   string produced by concatenating the symbols. The exact rows that
             #   produce the match are identified by correlating the matched string
             #   indices with another array containing row ids.
    -        #
    -        #   matched_partitions: For each partition (group), concatenate all symbols
    -        #       into a single string (sym_str). Keep corresponding ids in an array in
the
    -        #       same order as the symbols. This is performed only for partitions
    -        #       that contain a match.
    -        #   build_multiple_matched_rows:
    -        #       q1: Split sym_str into an array containing the lengths of the
    -        #           strings between the matches.
    -        #       q2: Store lengths of matches into an array
    -        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    -        #           Also right shift the matches array.
    -        #       q4: Compute the cumulative sum of the arrays.
    +
    +        # matched_partitions: For each partition, concatenate all symbols
    +        #       into a single string (sym_str). Keep corresponding ids in an
    +        #       array (match_to_row_id) in the same order as the symbols.
    +        #       This is performed only for partitions that contain a match.
    +        match_id_name = "__madlib_path_match_id__" if "match_id" in all_input_cols else
"match_id"
    +        symbol_name = "__madlib_path_symbol__" if "symbol" in all_input_cols else "symbol"
             plpy.execute("""
                 CREATE TEMP TABLE {matched_partitions} AS
                     SELECT
                         {p_col_name_str},
    -                    array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}),
'') as sym_str,
    -                    array_agg({id_col_name} ORDER BY {order_expr}) as matched_ids
    +                    array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}),
'') as sym_str,
    +                    array_agg({id_col_name} ORDER BY {order_expr}) as {match_to_row_id}
                     FROM {input_with_id}
    +                WHERE {short_sym_name_str} is NOT NULL
                     GROUP BY {p_col_name_str}
    -                HAVING array_to_string(array_agg({symbol_name_str} ORDER BY {order_expr}),
'') ~* '{pattern_expr}'
    +                HAVING array_to_string(array_agg({short_sym_name_str} ORDER BY {order_expr}),
'')
    +                                ~* '{new_pattern_expr}'
                 """.format(**locals()))
    +
    +        # length_of_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of each match.
    +        #       - output these lengths in an array (matches), along with
    +        #         an array of corresponding rank of each match (match_indices).
    +        length_of_matches = unique_string("match_length_view")
    +        plpy.execute("""
    +            CREATE VIEW {length_of_matches} AS
    +            SELECT
    +                {p_col_name_str},
    +                {match_to_row_id},
    +                array_agg(length(matches) ORDER BY match_index) AS matches,
    +                array_agg(match_index ORDER BY match_index) AS match_indices
    +            FROM (
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    generate_series(1, num_matches) AS match_index,
    +                    (regexp_matches(
    +                           sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        sym_str,
    +                        count(matches) AS num_matches
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            sym_str,
    +                            (regexp_matches(
    +                                   sym_str, '(?i)({new_pattern_expr})', 'g'))[1] AS matches
    +                        FROM
    +                           {matched_partitions}
    +                        ) t1
    +                    GROUP BY {p_col_name_str}, {match_to_row_id}, sym_str
    +                    ) t2
    +                ) subq2
    +            GROUP BY {p_col_name_str}, {match_to_row_id}
    +            """.format(**locals()))
    +
    +        # length_of_between_matches: For each partition in matched_partitions:
    +        #       - find all matches and compute the lengths of text between the matches.
    +        #       - output these lengths in an array (match_splits)
    +        length_of_between_matches = unique_string('match_split_view')
    +        plpy.execute("""
    +                CREATE VIEW {length_of_between_matches} AS
    +                SELECT
    +                    {p_col_name_str},
    +                    {match_to_row_id},
    +                    array_agg(length(match_splits) ORDER BY match_index) AS match_splits
    +                FROM (
    +                    SELECT
    +                        {p_col_name_str},
    +                        {match_to_row_id},
    +                        generate_series(1, ARRAY_UPPER(match_splits, 1)) AS match_index,
    +                        unnest(match_splits) AS match_splits
    +                    FROM (
    +                        SELECT
    +                            {p_col_name_str},
    +                            {match_to_row_id},
    +                            regexp_split_to_array(sym_str, '(?i){new_pattern_expr}')
AS match_splits
    +                        FROM
    +                            {matched_partitions}
    +                        ) ssubq1
    +                    ) subq1
    +                GROUP BY {p_col_name_str}, {match_to_row_id}
    +                """.format(**locals()))
    +        #   build_multiple_matched_rows:
    +        #       q1: Lengths of the strings between the matches (length_of_between_matches)
    +        #       q2: Lengths of matches (length_of_matches)
    +        #       q3: Merge q1 and q2 and unnest the arrays (ensuring same length).
    +        #           Since q2 represents the following match to each split in q1,
    +        #           right shifting q2 (prev_matches) gives the preceeding match to each
split.
    +        #       q4: Compute the cumulative sum of the arrays.
    +
             build_multiple_matched_rows = """
                 CREATE {table_or_view} {matched_rows} AS
    -            SELECT {all_input_cols_str}
    +            SELECT {all_input_cols_str},
    +                   {long_sym_name_str} AS {symbol_name},
    +                   {match_id} AS {match_id_name}
                 FROM
                     {input_with_id} as source,
                     (
    +                    -- get the actual row ids using the start/end values
                         SELECT
    -                        unnest(matched_ids[l:r]) AS matched_ids
    +                        unnest({match_to_row_id}[s:e]) AS {match_to_row_id},
    +                        match_index AS {match_id}
                         FROM
                         (
    +                     -- unnest to get tuples where each tuple gives the start/end of
a match
                          SELECT
    -                        matched_ids,
    -                        unnest(left_range) AS l,
    -                        unnest(right_range) AS r
    +                        {match_to_row_id},
    +                        unnest(match_start) AS s,
    +                        unnest(match_end) AS e,
    +                        unnest(match_indices) as match_index
                          FROM (
    +                        -- use cumulative sum to get the start and end of each match
    +                        -- (counting from start of symbol string)
                             SELECT
                                 {p_col_name_str},
    -                            matched_ids,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches))
AS left_range,
    -                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS
right_range
    +                            {match_to_row_id},
    +                            {m}.array_cum_sum({m}.array_add(match_splits, prev_matches))
AS match_start,
    +                            {m}.array_cum_sum({m}.array_add(match_splits, matches)) AS
match_end,
    --- End diff --
    
    +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message