flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcus Leich <marcus.le...@tu-berlin.de>
Subject Re: ScannerTimeout over long running process
Date Fri, 28 Nov 2014 19:33:58 GMT
Hi guys,

sorry for the delay, I meant to catch up on this issue earlier.
I’ve two points that I would like to bring up here:

1. (Less important) What causes the delays?
I think you should definitely investigate what causes the issue. The quickest thing to do
on your own is to try to keep a log of execution times of code that is directly chained to
the data source (that would be at least the mapper). I know doing that by hand is tedious,
but tools for that won’t be available in the very near future.

2. (Much more important) The delays should not be an issue!
HBase can restart a scan at any point within a region at fairly low cost, as long as you know
the key from which you want to start reading.
So, the idea would be to catch exactly the kinds of timeouts you are experiencing (maybe log
a warning) and directly create a new scanner that is configured to start at the position of
the last successfully retrieved tuple.
This approach means we would need to keep a copy of the key of freshest tuple returned by
each scanner in the input format. Of course, that comes with a certain cost, but my guess
would be HBase keys usually are not overlay large and performance drop significantly.
I have an unstable and outdated implementation of that approach somewhere in an old Stratosphere
branch and I could try polish it up so Flavio can try it out.

tl;dr
If you can’t prevent the timeout, embrace it and simply start a new scan from where you
left.

Best,
Marcus

> On 27 Nov 2014, at 20:16, Flavio Pompermaier <pompermaier@okkam.it> wrote:
> 
> Thanks Stephan for the support! Unfortunately we are not able to understand what lineage
of operators cause this problem..
> in our case we set the scan timeout to 15 minutes so I think we can exclude garbage collection
thus, probably, this is caused by the first option (unfortunately HBase cannot block scans
indefinitely..).
> 
> What can we do to debug this problem? can you give us more detail or links to the internals
of such situations? it is not very clear to me the relation between buffers, actions and pauses
between two consecutive nextRecord() on the same split of the inputFormat..


Mime
View raw message