crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Brush (JIRA)" <j...@apache.org>
Subject [jira] [Created] (CRUNCH-336) Optimized filters and joins via Parquet RecordFilters
Date Fri, 31 Jan 2014 15:58:09 GMT
Ryan Brush created CRUNCH-336:
---------------------------------

             Summary: Optimized filters and joins via Parquet RecordFilters
                 Key: CRUNCH-336
                 URL: https://issues.apache.org/jira/browse/CRUNCH-336
             Project: Crunch
          Issue Type: Improvement
            Reporter: Ryan Brush


Logging this to track some ideas from an offline discussion with [~jwills] and [~mkwhitacre].
There's an opportunity to significantly speed up a couple access patterns:

1. Process only a subset of data from a Parquet file identified by a single column
2. Perform a bloom filter join between two datasets, where the joined item is a Parquet column
in the larger data set.

Optimizing item 1 simply involves using a RecordFilter to narrow down the data loaded from
the AvroParquetInputFormat.

Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom filter join,
but using the bloom filter to implement the Parquet RecordFilter on the specific column. In
cases where where we join on columns and only select a small subset of the larger dataset,
this would skip IO and deserialization cost for all items that didn't match the join.

It's not obvious to me how we'd achieve this cleanly, since it involves multiple pieces (configuring
of inputs in conjunction with a specific join strategy). In many cases the bloom filter join
alone will achieve sufficient performance, but I'm logging this potential optimization for
reference.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message