drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aman Sinha <asi...@maprtech.com>
Subject Re: Centralizing batch size bounds for testing and tuning
Date Sat, 10 Jan 2015 01:45:35 GMT
Yes, in the past we have talked about this primarily in the context of unit
testing...certainly we want to be able to write tests with small input data
and exercise the batch boundaries.  Performance tuning is added benefit
once we can do some performance characterizations.
We might want to think about whether the batch size should be a static
value or something that is determined once the 'fast schema' is known for
the leaf operators.  The batch size could be a function of the row width...

Aman

On Fri, Jan 9, 2015 at 5:09 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> Hello Drillers,
>
> Currently each of the physical operators in Drill has it's own way of
> specifying how many records it will try to produce in a single batch. For
> some operators like project, the outgoing batch will be the same size as
> the incoming, in the case of a projection with no evaluations. If the size
> of the data is changing in a projection, such as converting a numeric type
> to varchar, we cannot guess how much memory will be needed in the outgoing
> buffer, so we may have to cut off the first batch once we run out of space
> and separately handle the overflowing data.
>
> In other operators, where the incoming streams cause the spawning of new
> outgoing records, we can not make a guess about the outgoing batch size, we
> just need to keep producing row and cutting off batches as we run out of
> space. Rather than hit exceptions in all cases, many of the operators have
> a loop termination based on some expected number of rows in a batch, this
> is generally around 4096. The record readers also define such limits.
>
> I believe standardizing this value and making it configurable may be useful
> for both debugging and tuning Drill. We have often found bugs around batch
> boundary conditions, which often necessitates generating larger test cases
> to reproduce problems and create unit tests once the issues are fixed. I'm
> thinking if we could lower this value we may be able to write more concise
> tests that easily demonstrate the boundary conditions in smaller input
> files and test definitions.
>
> This could also be useful for tuning drill. While we may not want to make
> this option available in production, we could use it in the meantime to
> drive efforts to identify best values in different scenarios when we
> stretch the limits of Drill. After a brief discussion with Steven he said
> that in some of his testing he was able to see some performance gains
> increasing the value from 4000, to 32k. This isn't a strong argument in
> itself for pushing up the default, as it will increase memory requirements
> and will likely hurt is in multi-user and environments running many
> concurrent queries. In these cases we may need to automatically throttle
> back the batch size to reduce overall memory usage of any particular
> operation.
>
> Making this would be a code change that would touch a fairly large number
> of files, but I think the possible benefits could justify the change, just
> wanted to collect thoughts from the community.
>
> - Jason
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message