crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <>
Subject [jira] [Commented] (CRUNCH-336) Optimized filters and joins via Parquet RecordFilters
Date Fri, 31 Jan 2014 16:30:10 GMT


Micah Whitacre commented on CRUNCH-336:

CRUNCH-299 was logged to track something similar to "item 1" but by interpreting a FilterFn
as a RecordFilter.  I'm leaning to agree with [~jwills] and [~gabriel.reid] that we should
just expose those options at source creation.

> Optimized filters and joins via Parquet RecordFilters
> -----------------------------------------------------
>                 Key: CRUNCH-336
>                 URL:
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
> Logging this to track some ideas from an offline discussion with [~jwills] and [~mkwhitacre].
There's an opportunity to significantly speed up a couple access patterns:
> 1. Process only a subset of data from a Parquet file identified by a single column
> 2. Perform a bloom filter join between two datasets, where the joined item is a Parquet
column in the larger data set.
> Optimizing item 1 simply involves using a RecordFilter to narrow down the data loaded
from the AvroParquetInputFormat.
> Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom filter
join, but using the bloom filter to implement the Parquet RecordFilter on the specific column.
In cases where where we join on columns and only select a small subset of the larger dataset,
this would skip IO and deserialization cost for all items that didn't match the join.
> It's not obvious to me how we'd achieve this cleanly, since it involves multiple pieces
(configuring of inputs in conjunction with a specific join strategy). In many cases the bloom
filter join alone will achieve sufficient performance, but I'm logging this potential optimization
for reference.

This message was sent by Atlassian JIRA

View raw message