commons-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Oswaldo Olivo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (COLLECTIONS-547) Performance issue in SetUtils::isEqualSet
Date Wed, 21 Jan 2015 21:10:35 GMT
Oswaldo Olivo created COLLECTIONS-547:
-----------------------------------------

             Summary: Performance issue in SetUtils::isEqualSet
                 Key: COLLECTIONS-547
                 URL: https://issues.apache.org/jira/browse/COLLECTIONS-547
             Project: Commons Collections
          Issue Type: Bug
          Components: Set
    Affects Versions: 4.1
         Environment: Ubuntu 14.04
            Reporter: Oswaldo Olivo
            Priority: Minor


There seems to be a performance problem with the function isEqualSet of 
SetUtils when the first parameter is of a collection type that has a slow containsAll/contains
method.

The following is the code of the function:

{code}
    /**
     * Tests two sets for equality as per the <code>equals()</code> contract
     * in {@link java.util.Set#equals(java.lang.Object)}.
     * <p>
     * This method is useful for implementing <code>Set</code> when you cannot
     * extend AbstractSet. The method takes Collection instances to enable other
     * collection types to use the Set implementation algorithm.
     * <p>
     * The relevant text (slightly paraphrased as this is a static method) is:
     * <blockquote>
     * <p>Two sets are considered equal if they have
     * the same size, and every member of the first set is contained in
     * the second. This ensures that the <tt>equals</tt> method works
     * properly across different implementations of the <tt>Set</tt>
     * interface.</p>
     *
     * <p>
     * This implementation first checks if the two sets are the same object:
     * if so it returns <tt>true</tt>.  Then, it checks if the two sets are
     * identical in size; if not, it returns false. If so, it returns
     * <tt>a.containsAll((Collection) b)</tt>.</p>
     * </blockquote>
     *
     * @see java.util.Set
     * @param set1  the first set, may be null
     * @param set2  the second set, may be null
     * @return whether the sets are equal by value comparison
     */
    public static boolean isEqualSet(final Collection<?> set1, final Collection<?>
set2) {
        if (set1 == set2) {
            return true;
        }
        if (set1 == null || set2 == null || set1.size() != set2.size()) {
            return false;
        }

        return set1.containsAll(set2);
    }

{code}

The problem is that in the last return statement, the function relies on the 
containsAll method of the class of the set1, which can be any type of collection.

The following test harness shows the performance degradation between 
using a list and using a set as a first parameter, across different collection sizes.

{code}
    public static void 	setUtilsisEqualSetTest(boolean original) {

	int N=500000;
	ArrayList<Integer> firstArrayList=new ArrayList<Integer>();
	ArrayList<Integer> secondArrayList=new ArrayList<Integer>();
	for(int i=0;i<N;++i) {

	    firstArrayList.add(new Integer(i));
	    secondArrayList.add(new Integer((N-1)-i));
	    
	}

	SetUtils.isEqualSet(original?firstArrayList:(new HashSet<Integer>(firstArrayList)),secondArrayList);
	
{code}

In the following table "Original" corresponds to the time taken by 
the existing implementation of SetUtils::isEqualSet, "Repaired" to the time 
taken by the function invoked with the first parameter converted into a 
set, and "Speed-up" to the speed-up obtained after the repair.


       N	Original(s)	Repaired(s)	Speed-up(X)

      10	          1.01		    1.02		     0.99
     100	          1.02		    1.02		     1
    1000	          1.04		    1.04		     1
   10000	          1.15		    1.09		     1.05
  100000	          9.33		    1.11		     8.40
  500000	        > 300	            1.26		     > 238.09


One way to deal with this would be to call the CollectionUtils::containsAll method instead
of Collection::containsAll, since it has a linear time implementation instead of quadratic,
and can handle the types of isEqualSet.

Another solution would be to include a warning to the user in the documentation that the first
parameter should have a fast containment method when possible.

 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message