maven-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Rosenvold <kristian.rosenv...@gmail.com>
Subject Files.walkFileTree performance vs plexus DirectoryScanner
Date Fri, 06 Mar 2015 07:19:43 GMT
2015-03-05 21:34 GMT+01:00 Igor Fedorenko <igor@ifedorenko.com>:

>
> On 2015-03-05 14:12, Kristian Rosenvold wrote:
>
>> Actually Files.walkFileTree is just about the only NIO 7 feature we're
>> not using. Anyone have any specific pointers/experience that actually
>> show this being faster than the current strategy ?
>>
>
> I ran some tests about a year ago on a large 200K files source tree and
> back then convinced myself Files.walkFileTree was noticeably faster.
> That test was on linux. Today I tried to reproduce the same results on
> osx and coulnd't, walking directory using new and old-style io perform
> about the same. Sorry for the misinformation.
>
> This is a pretty cool discussion by itself, so I figured I'd share my
thoughts on the subject:

The plexus directory scanner is pretty damn fast. I have done various POC's
over the years to try to make it faster and/or compare performance to other
methods. I also have several multithreaded implementations lying around in
various github repositories. It is trivial to make a multithreaded scanner
that is 2-3x faster than the single threaded one for most use cases.

The thing is, unless you are working on nexus/archiver some other impl with
*huge* structures, multithreading the scanning itself is basically an
extremely tiny optimization; directory scanning is really a very tiny
fraction of overall processing, and the OS is really good at caching this
stuff once it's been done once for the module. And for most modules we're
really only talking about a few milliseconds anyway.

So why would you want to parallelize this stuff? One thing that comes to
mind is to have the scanner "drive" the rest of the algorithm, also in
parallel. As an example, have the directory scanner hand off each file
*immediately* it is scanned for zip compression inside the jar. In such a
design one could see the threads of the directory scanner seeping into the
actual algorithm of the plugin/code that does the work, where it would
parallelize the internal plugin implementation. Unfortunately this is not
entirely trivial (you may read this as "damn hard") with existing
layerings, and I doubt there'd be too many places we could put this to good
use.

The second use case which is far more interesting is to avoid repeated IO,
which is where plexus-io shines but both DirectoryScanner and
Files.walkFileTree fail miserably. A *lot* of file operations cause
OS-level IO activity (file.exists, file.isFile, file.isDirectory come to
mind). With a scanner that returns file objects or strings, these
operations have to be repeated through layers and layers again. Looking at
the performance of maven code there is a *lot* of repeated IO regarding the
same files as they seep through the layers and the same state is being
constructed again and again and again and again and again. Plexus io
provides a fixed view of the file system as it was at the time of scanning
and attributes are never re-read.

Something like a multithreaded plexus io based on a multithreaded scanner
would probably have quite a significant impact on performance. I acutally
have this in a github repo too :) (A dirscanner that is "functionally"
implemented and accepts a Consumer<File> that accepts output)

Kristian

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message