samza-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julian Hyde <jul...@hydromatic.net>
Subject Re: Joining Avro records
Date Thu, 09 Apr 2015 19:35:24 GMT
Much of this is about mapping from logical fields (i.e. the fields you can reference in SQL)
down to the Avro representation; I’m no expert on that mapping, so I’ll focus on the SQL
stuff.

First, SQL doesn’t allow a record to have two fields of the same name, so you wouldn’t
be allowed to have two “name” fields. When you do a join, you might need to alias output
columns:

select stream orders.id, products.id as productId
from orders
join products on orders.id = products.id;

Second, JOIN isn’t the only SQL operator that combines records; GROUP BY also combines records.
JOIN combines records from different streams, and they usually have different types (i.e.
different numbers/types of fields), whereas GROUP BY combines records from the same stream.
Use whichever best suits your purpose.

select stream zipcode, floor(rowtime to hour), array_agg(orderid) as orderIds
from orders
group by zipcode, floor(rowtime to hour)

(array_agg is an aggregate function, recently added to the SQL standard, that gathers input
values into an array. See http://www.craigkerstiens.com/2013/04/17/array-agg/.)

Output:

{ zipcode: “94705”, rowtime: “2015-04-09 11:00:00”, orderIds: [123, 156, 1056] },
{ zipcode: “94117”, rowtime: “2015-04-09 11:00:00”, orderIds: [45, 777] },
{ zipcode: “94705”, rowtime: “2015-04-09 12:00:00”, orderIds: [55] }

Julian


On Apr 9, 2015, at 12:07 PM, Yi Pan <nickpan47@gmail.com> wrote:

> Hi, Roger,
> 
> Good question on that. I am actually not aware of any "automatic" way of
> doing this in Avro. I have tried to add generic Schema and Data interface
> in samza-sql branch to address the morphing of the schemas from input
> streams to the output streams. The basic idea is to have wrapper Schema and
> Data classes on-top-of the deserialized objects to access the data fields
> according to the schema w/o changing and copying the actual data fields.
> Hence, when there is a need to morph the input data schemas into a new
> output data schema, we just need an implementation of the new output data
> Schema class that can read the corresponding data fields from the input
> data and write them out in the output schema. An interface function
> transform() is added in the Schema class for this exact purpose. Currently,
> it only takes one input data and one example of "projection" transformation
> can be found in the implementation of AvroSchema class. A join case as you
> presented may well be a reason to have an implementation of "join" with
> multiple input data.
> 
> All the above solution is still experimental and please feel free to
> provide your feedback and comments on that. If we agree that this solution
> is good and suit for a broader use case, it can be considered to be used
> outside the "SQL" context as well.
> 
> Best regards!
> 
> -Yi
> 
> On Thu, Apr 9, 2015 at 8:55 AM, Roger Hoover <roger.hoover@gmail.com> wrote:
> 
>> Hi Milinda and others,
>> 
>> This is an Avro question but since you guys are working on Avro support for
>> stream SQL, I thought I'd ask you for help.
>> 
>> If I have a two records of type A and B as below and want to join them
>> similar to "SELECT *" in SQL to produce a record of type AB, is there an
>> simple way to do this with Avro without writing code to copy each field
>> individually?
>> 
>> I appreciate any help.
>> 
>> Thanks,
>> 
>> Roger
>> 
>> {
>>  "name": "A",
>>  "type": "record",
>>  "namespace": "fubar",
>>  "fields": [{"name": "a", "type" : "int"}]
>> }
>> 
>> {
>>  "name": "B",
>>  "type": "record",
>>  "namespace": "fubar",
>>  "fields": [{"name": "b", "type" : "int"}]
>> }
>> 
>> {
>>  "name": "AB",
>>  "type": "record",
>>  "namespace": "fubar",
>>  "fields": [{"name": "a", "type" : "int"}, {"name": "b", "type" : "int"}]
>> }
>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message