lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Herb Roitblat <herb.roitb...@orcatec.com>
Subject Re: Dimension mismatch exception
Date Fri, 21 Mar 2014 15:03:54 GMT
Computing the cosine between two documents requires that the vectors for 
each document to be the same length (same number of elements, same 
dimensionality, not the norm).  The length of the vector is the length 
of the vocabulary for the whole set.  The two sets will inevitably have 
different numbers of tokens in their vocabulary. Also, if they are 
indexed independently, then the words at each position in the two 
vectors is also going to be different.  The number of documents does not 
matter, nor the number of sentences.

There are several ways to address this problem.  One way is to index all 
of the documents into one index but keep track of which set is which.  
Then you can run the combinations and compute the cosines. Uwe knows 
more about how the term vectors are represented in Lucene.  You may have 
to do some extra work to get them into a form that you can use to 
compute cosines.

That's a lot of combinations, by the way, 10,000 x 20,000 = 200 million 
comparisons.  It's going to take a while.

See this page for some suggestions on how to do it: 
http://stackoverflow.com/questions/1844194/get-cosine-similarity-between-two-documents-in-lucene

On 3/21/2014 1:50 AM, Stefy D. wrote:
> Hello Herb. Thank you very much for your reply. I want to have the cosine for each a
and each b. I'm using code for lucene I found online, which I will post below.
>
> Hello Uwe. Thank you very much for replying. I am using a class DocVector and then a
class in which i try to compute the similarities from documents that were indexed in two folders.
Here is the code for the two classes.
>
> Could you please help me? What am I doing wrong? Thank you very much!
>
> package NewApp;
>
> import extractout.*;
> import java.util.Map;
> import org.apache.commons.math3.linear.OpenMapRealVector;
> import org.apache.commons.math3.linear.RealVectorFormat;
> import org.apache.commons.math3.linear.SparseRealVector;
>
> /**
>   *
>   * @author Stefy
>   */
> class DocVector {
>      
>       public Map<String,Integer> terms;
>        public SparseRealVector vector;
>        
>        public DocVector(Map<String,Integer> terms) {
>          this.terms = terms;
>          this.vector = new OpenMapRealVector(terms.size());
>        }
>        
>        public void setEntry(String term, int freq) {
>          if (terms.containsKey(term)) {
>            int pos = terms.get(term);
>            vector.setEntry(pos, (double) freq);
>          }
>        }
>        
>        public void normalize() {
>          double sum = vector.getL1Norm();
>          vector = (SparseRealVector) vector.mapDivide(sum);
>        }
>        
>      @Override
>        public String toString() {
>          RealVectorFormat formatter = new RealVectorFormat();
>          return formatter.format(vector);
>        }
> }
>
> ---------------------------------------------------------------------------------------
> public class testCosine {
>
>      static String in_B = "/local/march_exp/in_B";
>      static String data_B = "/local/march_exp/B_split100_EN";
>      static String in_A = "/local/march_exp/in_A";
>      static String data_A = "/local/march_exp/A_split100_EN";
>      static File indexDir_B, dataDir_B, indexDir_A, dataDir_A;
>      static IndexReader reader_A, reader_B;
>      static Directory dir_B, dir_A;
>      static int size_B = 23992, size_A = 10995;
>
>      private static double getCosineSimilarity(DocVector d1, DocVector d2) {
>          return (d1.vector.dotProduct(d2.vector))
>                  / (d1.vector.getNorm() * d2.vector.getNorm());
>      }
>
>      public static void testSimilarityUsingCosine() throws Exception {
>
>          indexDir_A = new File(in_A);
>          dir_A = FSDirectory.open(indexDir_A);
>          reader_A = IndexReader.open(dir_A);
>
>          indexDir_B = new File(in_B);
>          dir_B = FSDirectory.open(indexDir_B);
>          reader_B = IndexReader.open(dir_B);
>
>          Map<String, Integer> terms_A = new HashMap<String, Integer>();
>          TermEnum termEnum_A = reader_A.terms(new Term("contents"));
>          Map<String, Integer> terms_B = new HashMap<String, Integer>();
>          TermEnum termEnum_B = reader_B.terms(new Term("contents"));
>
>          int pos = 0;
>          while (termEnum_A.next()) {
>              Term term = termEnum_A.term();
>              if (!"contents".equals(term.field())) {
>                  break;
>              }
>              terms_A.put(term.text(), pos++);
>          }
>
>          pos = 0;
>          while (termEnum_B.next()) {
>              Term term = termEnum_B.term();
>              if (!"contents".equals(term.field())) {
>                  break;
>              }
>              terms_B.put(term.text(), pos++);
>          }
>
>
>          int[] docIds_A = new int[size_A];
>          DocVector[] docs_A = new DocVector[docIds_A.length];
>          int i = 0;
>          for (int docId : docIds_A) {
>              TermFreqVector[] tfvs = reader_A.getTermFreqVectors(docId);
>              docs_A[i] = new DocVector(terms_A);
>              for (TermFreqVector tfv : tfvs) {
>                  String[] termTexts = tfv.getTerms();
>                  int[] termFreqs = tfv.getTermFrequencies();
>                  for (int j = 0; j < termTexts.length; j++) {
>                      docs_A[i].setEntry(termTexts[j], termFreqs[j]);
>                  }
>              }
>              docs_A[i].normalize();
>              i++;
>          }
>
>          int[] docIds_B = new int[size_B];
>          DocVector[] docs_B = new DocVector[docIds_B.length];
>          i = 0;
>          for (int docId : docIds_B) {
>              TermFreqVector[] tfvs = reader_B.getTermFreqVectors(docId);
>              docs_B[i] = new DocVector(terms_B);
>              for (TermFreqVector tfv : tfvs) {
>                  String[] termTexts = tfv.getTerms();
>                  int[] termFreqs = tfv.getTermFrequencies();
>                  for (int j = 0; j < termTexts.length; j++) {
>                      docs_B[i].setEntry(termTexts[j], termFreqs[j]);
>                  }
>              }
>              docs_B[i].normalize();
>          }
>
>          FileWriter fstream_c = new FileWriter("/local/march_exp/COS/COSINE_.txt");
>          BufferedWriter writer_c = new BufferedWriter(fstream_c);
>
>          double[][] cosimvect = new double[size_A][size_B];
>          for (i = 0; i < size_A; i++) {
>              for (int j = 0; j < size_B; j++) {
>                  cosimvect[i][j] = getCosineSimilarity(docs_A[i], docs_B[j]);
>                  System.out.println("cosine between " + i + " " + j + " is " + cosimvect[i][j]);
>              }
>          }
>          writer_c.close();
>          reader_B.close();
>          reader_A.close();
>          dir_B.close();
>          dir_A.close();
>      }
>
>      public static void main(String[] args) throws Exception {
>
>          testSimilarityUsingCosine();
>      }
> }
>   
>
>
>
>
>
> On Friday, March 21, 2014 12:14 AM, Uwe Schindler <uwe@thetaphi.de> wrote:
>   
> Hi Stefy,
>
> the stack trace you posted has nothing to do with Apache Lucene. It looks like you are
using some commons-lang3 classes here, but no Lucene code at all. So I think your question
might be better asked on the commons-math mailing list, unless you have some Lucene code around,
too. If this is the case, you should give more information how you use Lucene.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>
>> -----Original Message-----
>> From: Stefy D. [mailto:tsuki_stefy@yahoo.com]
>> Sent: Thursday, March 20, 2014 10:05 PM
>> To: java-user@lucene.apache.org
>> Subject: Dimension mismatch exception
>>
>> Dear all,
>>
>> I am trying to compute the cosine similarity between several documents. I
>> have an indexed directory A made using 10000 files and another indexed
>> directory B made using 20000 files. All the indexed documents from both
>> directories have the same length (100 sentences). I want to get the cosine
>> similarity between documents from directory A and documents from
>> directory B. I have used the code from here but on the two indexed
>> directories. So I use something like getCosineSimilarity(docs_A[i], docs_B[j]);
>>
>> I get the following error:
>> Exception in thread "main"
>> org.apache.commons.math3.exception.DimensionMismatchException:
>> 44,375 != 596,263
>>       at
>> org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
>> Vector.java:179)
>>       at
>> org.apache.commons.math3.linear.RealVector.checkVectorDimensions(Real
>> Vector.java:165)
>>       at
>> org.apache.commons.math3.linear.RealVector.dotProduct(RealVector.java:3
>> 07)
>>       at NewApp.testCosine.getCosineSimilarity(testCosine.java:57)
>>
>> Please help me. Thank you very much!
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message