uima-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marshall Schor <...@schor.com>
Subject Re: can't remove duplicate Annotations with Java Set Collection
Date Tue, 18 Nov 2014 14:37:36 GMT
Hi Kameron,

Based on this code snip, the two "cat" annotations you create are "different"
using the HashSet definition, because they correspond to two distinct UIMA
Annotations.  You could, for instance, update one of them, and not the other;
that it the sense in which they are distinct.  In the case below, the two "cat"
annotations would have different begin and end offsets.

I'm guessing that your goal was to to have one of the two cat annotations be
dropped.

You could do that by using your hash set approach, if you defined equal to mean
that just the covered text of the annotation was equal.

Here's one way to do this:  Create a "cover object" for your annotations, that
contains a reference to the annotation and defines equals and hashcode (you have
to define these together).  The easy way to do this is using Eclipse - define a
new class: e.g.

public class MyAnnotationWithSpecialEquals {
  final public Annotation annotation;   // the covered annotation
 
  public MyAnnotationWithSpecialEquals(Annotation annotation) {
    this.annotation = annotation;
  }
}

and then use Eclipse to define the equals and hashcode:  go to Menu -> Source ->
Generate hashcode() and equals()
and have it generate one based on just "annotation".  This will not (yet) be
correct - it should add two methods like this:

  @Override
  public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((annotation == null) ? 0 : annotation.hashCode());
    return result;
  }

  @Override
  public boolean equals(Object obj) {
    if (this == obj)
      return true;
    if (obj == null)
      return false;
    if (getClass() != obj.getClass())
      return false;
    MyAnnotationWithSpecialEquals other = (MyAnnotationWithSpecialEquals) obj;
    if (annotation == null) {
      if (other.annotation != null)
        return false;
    } else if (!annotation.equals(other.annotation))
      return false;
    return true;
  }

Now, to get these to be the definitions you want, which depend only on the
covered text, modify these as follows:

First, for hashCode, use only the string covered text:

  @Override
  public int hashCode() {
    final int prime = 31;
    int result = 1;
    result = prime * result + ((annotation == null) ? 0 :
annotation.getCoveredText().hashCode());
    return result;
  }

and for equals: replace test for annotation being "equal" with
annotation.getCoveredText() being "equal",
with some additional edge case testing in case of nulls:

@Override
  public boolean equals(Object obj) {
    if (this == obj)
      return true;
    if (obj == null)
      return false;
    if (getClass() != obj.getClass())
      return false;
    MyAnnotationWithSpecialEquals other = (MyAnnotationWithSpecialEquals) obj;
    if (annotation == null) {
      if (other.annotation != null)
        return false;
    } else {
      String coveredText = annotation.getCoveredText();
      if (coveredText == null) {
         if (other.annotation.getCoveredText() == null)
            return true;  // handle special case if covered text is null
         else return false;
      }
      // coveredText is not null
      if (!coveredText.equals(other.annotation.getCoveredText()))
        return false;
      return true;
    }
  }

HTH.  -Marshall


On 11/17/2014 4:49 PM, Kameron Cole wrote:
>
> Input text:
>
> ------------------------------
>
> bird, cat, bush, cat
>
> ----------------------------
>
> Create the Annotations:
>
> -------------------------------
> docText = aJCas.getDocumentText();
>
> *int* index = docText.indexOf("cat");
> *while*(index >= 0) {
> *int* begin = index;
> *int* end = begin+3;
> Animal animal = *new* Animal(aJCas);
> animal.setBegin(begin);
> animal.setEnd(end);
> animal.addToIndexes();
>  
>    index = docText.indexOf("cat", index+1);
> }
>
> index = docText.indexOf("bird");
> *while*(index >= 0) {
> *int* begin = index;
> *int* end = begin+4;
> Animal animal = *new* Animal(aJCas);
> animal.setBegin(begin);
> animal.setEnd(end);
> animal.addToIndexes();
>  
>    index = docText.indexOf("bird", index+1);
> }
>
> index = docText.indexOf("bush");
> *while*(index >= 0) {
> *int* begin = index;
> *int* end = begin+4;
> Vegetable animal = *new* Vegetable(aJCas);
> animal.setBegin(begin);
> animal.setEnd(end);
> animal.addToIndexes();
>  
>    index = docText.indexOf("bird", index+1);
> }
> ------------------------------------------------------
>
>     --------------------------------------------------------------------------------
>
>     *Kameron Arthur Cole
>     Watson Content Analytics Applications and Support
>     email: **kameroncole@us.ibm.com* <mailto:kameroncole@us.ibm.com>* | Tel:
>     305-389-8512**
>     **upload logs here* <http://www.ecurep.ibm.com/app/upload>  
>
> 	
>
> 	
>
>     <http://www.facebook.com/ibmwatson><https://twitter.com/@ibmwatson><http://www.youtube.com/user/IBMWatsonSolutions/videos>
>
>
>     --------------------------------------------------------------------------------
>
>
>
> Inactive hide details for Marshall Schor ---11/17/2014 04:35:06 PM---Hi, Two
> Feature Structures are considered "equal" in the sMarshall Schor ---11/17/2014
> 04:35:06 PM---Hi, Two Feature Structures are considered "equal" in the sense
> used by HashSet, if
>
> From: Marshall Schor <msa@schor.com>
> To: user@uima.apache.org
> Date: 11/17/2014 04:35 PM
> Subject: Re: can't remove duplicate Annotations with Java Set Collection
>
> --------------------------------------------------------------------------------
>
>
>
> Hi,
>
> Two Feature Structures are considered "equal" in the sense used by HashSet, if
> fs1.equals(fs2).   The definition of "equals" for feature structures is: they
> are equal if they refer to the same underlying CAS, and the same "spot" in the
> the CAS Heap.
>
> How did you create the Annotations that you think are "equal" in the HashSet
> sense?
>
> Here's an example of two annotations which are "equal" in the UIMA sorted index
> sense, but unequal in the HashSet sense.
>
>    Annotation fs1 = new Annotation(myJCas, 0, 4); // create an instance of
> Annotation in myJCas, with a begin = 0, and end = 4.
>    Annotation fs2 = new Annotation(myJCas, 0, 4); // create an instance of
> Annotation in myJCas, with a begin = 0, and end = 4.
>
> These will be "equal" in the UIMA sense - the same kind of annotation, in the
> same CAS, with the same feature values, but will be two distinct feature
> structures, so HashSet will consider them to be unequal.
>
> Could this be what is happening in your case?  Please respond so we can see if
> there's another straight-forward solution that does what you're looking for.
>
> -Marshall
> on 11/17/2014 2:59 PM, Kameron Cole wrote:
> > Hello,
> >
> > I am trying to get rid of duplicates in the FSIndex.  I thought a very
> > clever way to do this would be to just push them into a Set Collection in
> > Java, which does not allow duplicates. This is very (very) standard Java:
> >
> > ArrayList al = new ArrayList();
> > // add elements to al, including duplicates
> > HashSet hs = new HashSet();
> > hs.addAll(al);
> > al.clear();
> > al.addAll(hs);
> >
> > This list will contain no duplicates.
> >
> > However, I am not getting this to work in my UIMA code:
> >
> >
> > System.out.println("Index size is: "+idx.size());
> >
> > AnnotationIndex<Annotation> idx = aJCas.getAnnotationIndex();
> >
> > ArrayList<Annotation> tempList = new ArrayList<Annotation>(idx.size());
> >
> > FSIterator it  = idx.iterator();
> >
> > //load the Annotations into a temporary list.  includes duplicates
> >
> > while(it.hasNext())
> > {
> >
> > tempList.add((Annotation) it.next());
> >
> > }
> >
> > Iterator tempIt = tempList.iterator();
> >
> > // remove all Annotations from the index.  this works fine
> >
> > while(tempIt.hasNext()){
> > ((Annotation) tempIt.next()).removeFromIndexes(aJCas);
> > }
> >
> > // push tempList into HashSet
> >
> > HashSet<Annotation> hs = new HashSet<Annotation>();
> >
> > hs.addAll(tempList);
> >
> > // this should not allow duplicates
> >
> > System.out.println("HS length: "+hs.size()); // size should be less the
> > size of the FSIndex by the number of duplicates.  it is not. This is the
> > main problem
> >
> > tempList.clear();
> >
> > tempList.addAll(hs);
> >
> > System.out.println("templist length: "+tempList.size());
> >
> >
> > Iterator<Annotation> it2 = tempList.iterator(); // this should now be the
> > clean list
> >
> >
> > while(it2.hasNext()){
> > it2.next().addToIndexes(aJCas);
> > }
>
>


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message