incubator-drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eriksson Magnus <magnus_p.eriks...@scania.com>
Subject Re: meeting notes 10/22/13
Date Wed, 23 Oct 2013 04:12:19 GMT
Sent to the wrong email.

Best regards
Magnus in Sweden
For every hard problem there is at least one solution that is simple, easy to understand and
wrong...

Lisen Mu wrote:
Thanks Jason! And thanks for everyone's time!

* Push from leaves
Thanks for Jacques' suggest. Indeed, in our current implementation, we need
to take special care for RecordBatches with multiple inputs, and need more
memory when executing, Join specifically. Your suggestion about prefetch
reminds me 2 improvements on this:

1. I can add a finite queue at each edge in the dag serving as data buffer.
When the queue is full, it would stall the operator who is putting data
into it. This should solve the huge memory problem.
2. with 1, I think the execution could be more storm-like. Each RecordBatch
driven by its input, via a thread pool. This way we can better parallelize
cpu as well as IO.
3. the downside of 1 & 2 is more execution overhead on cpu. However, we
will modify our push implementation and see what we can get. Thanks!

* for stream processing
One of our implementation goal is to use the same set of RecordBatch
implementations for both standard pull exec and push exec, and possibly
stream processing. For resources, it could also be memory consuming if we
want to join (relatively) static dimension table and streaming fact fable,
because dimension should entirely fit into memory.
This is still far off radar now; just thinking: we can dynamically
add/remove arbitrary query into the graph; data stream-in and graph is
updated; whenever we need result, just access the corresponding node in the
graph; the graph conext resets periodically at time window boundary, the
topology remains.

* for approximation
As Jason said, our current work on sampling makes many assumptions on our
data distribution. I can hardly imagine how this part of code can be useful
for others, except the CountDistinct. However I would try to sort out
something to share if I can get something general.





On Wed, Oct 23, 2013 at 12:49 AM, Jason Altekruse
<altekrusejason@gmail.com>wrote:

> Hello All,
>
> Here are the notes from todays hangout. Michael, can you copy them into the
> google doc?
>
> participants: Jacques, Micheal hausenblas, Lisen Mu, Yash Sharma, Jinfeng,
> Jason Altekruse, Harri, Steven Phillips, Timothy Chen, Julien Hyde
>
> New employee at MapR: Jinfeng
>     - couple more in the next month
>
> Jacques:
>     - merged limit
>     - clarify VVs
>         - never access internal state of VV when it is invalid
>     - release notes
>
> Steven:
>     - ordered partitioner
>         - abstract out distributed cache interface
>     - continue to work on spooling to disk
> Jason:
>     -semi-blocking
>         - look at sort and ordered hash partitioner
>
> Yash
>     - name of functions
>         - separate class for operators and functions for more clarity
>             - different operators have their own class files
>
> Lisen
>     - fork of Drill
>         - data pushed form leaves rather than pulled from root
>         - we have been thinking about this same problem
>             - don't want to wait for IO all the time
>             - pre-fetch rather than push
>             - in a join you might get pushed a huge amount of data when you
> aren't ready for it
>             - stream processing
>                 - alternative concept around foreman
>                 - not quite right for streams
>                 - resource allocation
>                     - not as much for resource requirements
>         -HyperLogLog
>             - space saving
>             - acceptable - not precise
>         - data assembly - business logic
>             - approximations will be important to drill
>             - no serious thinking about sampling
>             - certain types of scanners should support sampling
>                 - hard with some without reading all data anyway
>                 - Hbase might be easier to do a scan
>             - doing it with their own business logic and statistics
>                 - hard to generalize
>
> Hari
>     - not much for updates
>     - pick up with amazon ec2 docs
>         - had problem where we need 8 gigs
>         - cannot get it running on free micro instance
>         - got it working removing the direct memory flag in POM
>         - tim - out of memory exception right away
>             - was this with or without changing the option for direct
> memory?
>
> Tim
>     - wir patch in
>     - amp labs big data benchmark
>         - having numbers for performance evaluation
>         - set up on their repo for drill datasets
>         - installing HDFS to all of the nodes
>         - doesn't look to complicated
>     - cannot submit sql in distributed mode because of bad optimizer
>     - recent review board patches
>         - describe code more completely
>         - hard to review without docs
>         - Julien - single powerpoint slide per operator
>         - google doc? like the logical plan doc
>
>
> Ben
>     - code gen portion of merging receiver
>     - no blockers
>         - getting to code review soon
>
> Julian
>     - joined hortonworks
>     - working on optiq
>     - helping hive, but also working on Drill
>     - making optiq everything it can be
>     - splitting JDBC into thin client
>         - thinking about it, no implementation yet
>         - right now pushing sorts down to Mongo
>     - jacques - session next week on JDBC?
>     - roadmap on optiq
>         - commit logs tell some of the story
>         - roadmap would be helpful
>         - will put out call for optiq users like drill
>         - put together feature list for next release(s)
>         - next 6 months, want to be agile, but wants to be more predictable
>         - Jinfeng will be working with optimizer and optiq
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message