ant-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <nicolas.lale...@hibnet.org>
Subject Re: A performance issue in 'restrict'
Date Sat, 19 Jun 2010 13:05:22 GMT

Le 15 juin 2010 à 18:38, Matt Benson a écrit :

> On 6/15/10, Nicolas Lalevée <nicolas.lalevee@hibnet.org> wrote:
>> On Monday 14 June 2010 20:01:08 Matt Benson wrote:
>>> Hi, Nicolas:
>>> 
>>> [SNIP]
>>> 
>>>> I looked into the code, it seems
>>>> org.apache.tools.ant.types.resources.Restrict is the culprit. It uses
>>>> BaseResourceCollectionWrapper which loads the entire underlying resource
>>>> collection.
>>>> I searched for the use of BaseResourceCollectionWrapper in Ant and I
>>>> think that this performance issue may also affect
>>>> org.apache.tools.ant.types.resources.SizeLimitCollection and
>>>> org.apache.tools.ant.types.resources.Tokens.
>>>> 
>>>> I started to think about a fix but it seems non trivial. If I'm not
>>>> mistaken we would have to implement an iterator which should deal with
>>>> an
>>>> optionnal cache and concurrency.
>>> 
>>> While it would be possible to reimplement Restrict's internal wrapper
>>> such that it would dynamically calculate the included resources while
>>> iterating, I don't see any way to do the same when the *size* of the
>>> collection is requested...
>> 
>> Yep if size is called then the whole collection has to be "fetched".
>> I just hope that in my use case where I just want the first element of the
>> restrict, the whole resource collection won't be fetched. And as I keep
>> looking into the code, it seems that the implementation of
>> org.apache.tools.ant.types.resources.First does that. So size() shouldn't be
>> an issue here.
>> 
>>>> I am also worried about the isFilesystemOnly implementation in Restrict.
>>>> Should it return true if actually selected resources are files, or
>>>> return
>>>> true if the entire set of candidates are files ? In the first case it
>>>> would make the iterator useless.
>>> 
>>> Typically the implementation of #isFilesystemOnly() in
>>> BaseResourceCollectionWrapper and BaseResourceCollectionContainer is a
>>> quick check whether the underlying collection/s is/are known to be
>>> entirely filesystem-based.  If not, each resource is checked in turn
>>> until one is found that is *not* filesystem-based.  The reason for the
>>> distinction is to provide a means of recognizing whether it should be
>>> possible to handle a given resourcecollection using a task that can
>>> only work with files.
>> 
>> So I am suggesting a little change on the implementation: removing the check
>> isFilesystemOnly on the underlying resource collection; just keep the
>> iteration on the actual resources of the underlying collection.
>> I don't think it will change the semantic of the function, right?
>> About performance, in my use case it should behave better as the <first>
>> will
>> only load one resource in the collection.
>> I think there could be a performance impact where a resource collection is
>> known to be file system only, and it is used with a task that cannot support
>> that. The resources will be loaded but not actually used as the build will
>> fail.
>> So it seems a choice between:
>> * useless iteration on some resources which can have high latency
>> * useless iteration on some resources which should be on a filsytem and
>> trigger a failed build.
>> 
>> I prefer the second choice. I might not be impartial though ;)
>> 
> 
> I feel like if we get stuck on #isFilesystemOnly(), we're going to
> lose sight of the important issue here.  This kind of diagnostic
> method must, by definition, ultimately survey each resource, but
> checking against the nested collection/s is a valuable optimization.
> Consider the case where you may wrap a <first> in a <sort> in a
> <restrict> of some other resourcecollection.  If each of these has to
> fully iterate, that amounts to greater latency than attempting to
> delegate that call all the way back to the innermost nested
> collection, does it not?

When we use the iterator on the top resource collection, we will only get the necessary resources.
<first> will only fetch one resource. Because of <sort>, actually every resource
will be loaded, but it is only due to the sort function, which algorithm force use to fetch
every resource even if we want only one. So in that case, yes, asking for the underlying collection
if it is filesystem only is an optimization. This is the second case I pointed above. The
optimization is valid only on structurally filesystem only resource collection.

Maybe the optimization should be more precise when querying the underlying resource collection.
Today there is a check if the underlying collection actually contains filesystem only resources.
Whereas it could just check that the underlying resource collection is structurally a filesystem
resource collection; and if not then the top most resource collection will iterate on the
resources.
The improvement should then to add a new function isFileSystemOnlyCollection which should
not check actual resources but check only its own "type". ResourceCollection being an interface
we cannot change it without breaking backward compatibility. So we are back to the above two
"bad" choices.


> 
> The bigger issue is whether we can figure out a way to increase
> throughput by delegating iterator calls.  The problem is that
> BaseResourceCollectionWrapper and BaseResourceCollectionContainer both
> implement the cache attribute by caching a collection of resources.
> We could explore making the implementation of the cached collection
> more efficient.  Once again, my understanding of your wider purpose
> here is, given:
> 
> <first>
>  <restrict>
>    <someotherresourcecollection />
>  </restrict>
> </first>
> 
> You would like for the innermost collection to iterate only until
> restrict accepts one of its resources.  Perhaps the original
> implementation is naive in its simplicity.  It has been several years
> now since the bulk of that code was added to Ant;

There is always room for some optimization; I prefer a lot seeing not-yet-improved code that
already buggy code ;)

> if we can figure out
> how to optimize it now I would be perfectly happy with that, but I
> don't think #isFilesystemOnly() is germane to that efficiency
> discussion.

you are right. I wondered about isFilesystemOnly because if a task happens to call that function,
then a smart and lazy iterator would be useless.

I'll try to implement that iterator.

Nicolas


> 
> -Matt
> 
>> Nicolas
>> 
>> 
>>> 
>>> -Matt
>>> 
>>>> Side note: I didn't checked that piece of script in as our Hudson
>>>> instance doesn't have Ant 1.8 installed. Not yet. I'll ask for it.
>>>> 
>>>> Nicolas
>>>> 
>>>> [1] http://ant.apache.org/manual/Types/resources.html#resourcelist
>>>> 
>>>> PS: I started recently to intensively use loadresource and all the
>>>> resource related stuff in my build scripts, it is amazingly porwerful !
>>>> It reminds me when I was 'pipelining' in cocoon ;)
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
>>>> For additional commands, e-mail: dev-help@ant.apache.org
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
>>> For additional commands, e-mail: dev-help@ant.apache.org
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
>> For additional commands, e-mail: dev-help@ant.apache.org
>> 
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
> For additional commands, e-mail: dev-help@ant.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@ant.apache.org
For additional commands, e-mail: dev-help@ant.apache.org


Mime
View raw message