spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Segel <msegel_had...@hotmail.com>
Subject Indexing w spark joins?
Date Mon, 17 Oct 2016 16:49:26 GMT
Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list and I’m
certain that my last surviving brain cell has gone on strike over my attempt to reduce my
caffeine intake…

Posting this to both user and dev because I think the question / topic jumps in to both camps.


Again since I’m a relative newbie on spark… I may be missing something so apologies up
front…


With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In post 2.0.x you
have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query performance?

The issue is that when you have a data sewer (lake) , what happens when your use case query
is orthogonal to how your data is stored? This means full table scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing your storage
footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a field in
a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys that you would
then use in the join process (same as if you produced the subset based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of software that
sits in the repair shop (e.g Joe’s Auto Body) and works with multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.  This makes it
very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of repairing a front
end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and color.  (Then
see if there is any correlation between car color and # and type of accidents)


As you can see, all of these queries are orthogonal to my storage.  So I need to create secondary
indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to their big
data efforts is purely coincidence  and you can replace CCC with Allstate, Progressive, StateFarm
or some other auto insurance company …

Thx

-Mike


Mime
View raw message