uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kameron Cole <kameronc...@us.ibm.com>
Subject Re: can't remove duplicate Annotations with Java Set Collection
Date Mon, 17 Nov 2014 21:49:45 GMT

Input text:

------------------------------

bird, cat, bush, cat

----------------------------

Create the Annotations:

-------------------------------
docText = aJCas.getDocumentText();

		 int index = docText.indexOf("cat");
		 while(index >= 0) {
			 int begin = index;
				int end = begin+3;
				Animal animal = new Animal(aJCas);
				animal.setBegin(begin);
				animal.setEnd(end);
				animal.addToIndexes();

		    index = docText.indexOf("cat", index+1);
		 }

		 index = docText.indexOf("bird");
		 while(index >= 0) {
			 int begin = index;
				int end = begin+4;
				Animal animal = new Animal(aJCas);
				animal.setBegin(begin);
				animal.setEnd(end);
				animal.addToIndexes();

		    index = docText.indexOf("bird", index+1);
		 }

		 index = docText.indexOf("bush");
		 while(index >= 0) {
			 int begin = index;
				int end = begin+4;
				Vegetable animal = new Vegetable(aJCas);
				animal.setBegin(begin);
				animal.setEnd(end);
				animal.addToIndexes();

		    index = docText.indexOf("bird", index+1);
		 }
------------------------------------------------------
                                                                               
                                                                               
                                                                               
 Kameron Arthur Cole                                                           
 Watson Content                                                                
 Analytics Applications                                                        
 and Support                                                                   
 email:                                                                        
 kameroncole@us.ibm.com                                                        
 | Tel: 305-389-8512                                                           
 upload logs here                                                              
                                                                               
                                                                               
                                                                               
                                                                               
                                                                               





From:	Marshall Schor <msa@schor.com>
To:	user@uima.apache.org
Date:	11/17/2014 04:35 PM
Subject:	Re: can't remove duplicate Annotations with Java Set Collection



Hi,

Two Feature Structures are considered "equal" in the sense used by HashSet,
if
fs1.equals(fs2).   The definition of "equals" for feature structures is:
they
are equal if they refer to the same underlying CAS, and the same "spot" in
the
the CAS Heap.

How did you create the Annotations that you think are "equal" in the
HashSet sense?

Here's an example of two annotations which are "equal" in the UIMA sorted
index
sense, but unequal in the HashSet sense.

    Annotation fs1 = new Annotation(myJCas, 0, 4); // create an instance of
Annotation in myJCas, with a begin = 0, and end = 4.
    Annotation fs2 = new Annotation(myJCas, 0, 4); // create an instance of
Annotation in myJCas, with a begin = 0, and end = 4.

These will be "equal" in the UIMA sense - the same kind of annotation, in
the
same CAS, with the same feature values, but will be two distinct feature
structures, so HashSet will consider them to be unequal.

Could this be what is happening in your case?  Please respond so we can see
if
there's another straight-forward solution that does what you're looking
for.

-Marshall
on 11/17/2014 2:59 PM, Kameron Cole wrote:
> Hello,
>
> I am trying to get rid of duplicates in the FSIndex.  I thought a very
> clever way to do this would be to just push them into a Set Collection in
> Java, which does not allow duplicates. This is very (very) standard Java:
>
> ArrayList al = new ArrayList();
> // add elements to al, including duplicates
> HashSet hs = new HashSet();
> hs.addAll(al);
> al.clear();
> al.addAll(hs);
>
> This list will contain no duplicates.
>
> However, I am not getting this to work in my UIMA code:
>
>
> System.out.println("Index size is: "+idx.size());
>
> AnnotationIndex<Annotation> idx = aJCas.getAnnotationIndex();
>
> ArrayList<Annotation> tempList = new ArrayList<Annotation>(idx.size());
>
> 		 FSIterator it  = idx.iterator();
>
> //load the Annotations into a temporary list.  includes duplicates
>
> 		 while(it.hasNext())
> 		 {
>
> 		 		 tempList.add((Annotation) it.next());
>
> 		 }
>
> Iterator tempIt = tempList.iterator();
>
> // remove all Annotations from the index.  this works fine
>
> 		 		 while(tempIt.hasNext()){
> 		 		 		 ((Annotation) tempIt.next
()).removeFromIndexes(aJCas);
> 		 		 }
>
> // push tempList into HashSet
>
> 		 HashSet<Annotation> hs = new HashSet<Annotation>();
>
> 		 hs.addAll(tempList);
>
> // this should not allow duplicates
>
> System.out.println("HS length: "+hs.size()); // size should be less the
> size of the FSIndex by the number of duplicates.  it is not. This is the
> main problem
>
> tempList.clear();
>
> 		 tempList.addAll(hs);
>
> 		 System.out.println("templist length: "+tempList.size());
>
>
> Iterator<Annotation> it2 = tempList.iterator(); // this should now be the
> clean list
>
>
> 		 		 while(it2.hasNext()){
> 		 		 		 it2.next().addToIndexes(aJCas);
> 		 		 }


Mime
  • Unnamed multipart/related (inline, None, 0 bytes)
View raw message