crunch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Micah Whitacre (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CRUNCH-336) Optimized filters and joins via Parquet RecordFilters
Date Fri, 31 Jan 2014 16:30:10 GMT

    [ https://issues.apache.org/jira/browse/CRUNCH-336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887884#comment-13887884
] 

Micah Whitacre commented on CRUNCH-336:
---------------------------------------

CRUNCH-299 was logged to track something similar to "item 1" but by interpreting a FilterFn
as a RecordFilter.  I'm leaning to agree with [~jwills] and [~gabriel.reid] that we should
just expose those options at source creation.

> Optimized filters and joins via Parquet RecordFilters
> -----------------------------------------------------
>
>                 Key: CRUNCH-336
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-336
>             Project: Crunch
>          Issue Type: Improvement
>            Reporter: Ryan Brush
>
> Logging this to track some ideas from an offline discussion with [~jwills] and [~mkwhitacre].
There's an opportunity to significantly speed up a couple access patterns:
> 1. Process only a subset of data from a Parquet file identified by a single column
> 2. Perform a bloom filter join between two datasets, where the joined item is a Parquet
column in the larger data set.
> Optimizing item 1 simply involves using a RecordFilter to narrow down the data loaded
from the AvroParquetInputFormat.
> Optimizing item 2 is more involved. In a nutshell, we discussed doing a bloom filter
join, but using the bloom filter to implement the Parquet RecordFilter on the specific column.
In cases where where we join on columns and only select a small subset of the larger dataset,
this would skip IO and deserialization cost for all items that didn't match the join.
> It's not obvious to me how we'd achieve this cleanly, since it involves multiple pieces
(configuring of inputs in conjunction with a specific join strategy). In many cases the bloom
filter join alone will achieve sufficient performance, but I'm logging this potential optimization
for reference.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message