lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From trupti mulajkar <acp0...@sheffield.ac.uk>
Subject Indexing Problems
Date Fri, 07 Apr 2006 23:31:37 GMT
hello,

i have modified the IndexFiles.java to read the document numbers from within the
TREC files, which are also being read correctly, however the index fails to
create the .cfs file.
thus the search query does not return the correct document number.
any suggestions how this can be sorted?

cheers,
trupti mulajkar
MSc Advanced Computer Science


Quoting Chris Hostetter <hossman_lucene@fucit.org>:

> 
> first off: you should double check the correctness ofyour customized
> similarity class.  I'm pretty sure it's resulting in a differnet set of
> matches then the DefaultSimilarity because your tf function returns 0f
> regardless of wether there is a match.  (when i said "every function
> returns 0 or 1" i ment you acctually have to have the if (match) return 1
> else 0 logic at a minimum)
> 
> second: these times really shocked me untill i realized you were
> reporting the sum of the execution times for all 10000 iterations (whew!)
> 
> third: for queries this simple, i don't think you're going find much
> differneces in speed between the differnet models, but if you elliminate
> the Similarity as a variable (compare only #1 and #3 for each query), you
> can see that HitCollector was fairly consistently "a little faster" then
> using Hits (but i'll admit, i thought it would be "more faster")
> 
> I think if you tried bigger, more complex queries (nested booleans, with
> mandatory/required clauses, span queries, booleans with 20 or more
> optional clauses) you might see a bigger discrepency.
> 
> it's also not clear how big your test index is, as doug has pointed out
> before, it can make a big difference in benchmarking.
> 
> 
> : For each query I ran the ran 4 tests of 10,000 searches:
> : 1) using hits.length to get the counts and the standard similarity
> : 2) using hits.length to get the counts and a custom similarity
> : 3) using HitCollector to get the counts and the standard similarity
> : 4) using HitCollector to get the counts and a custom similarity
> :
> : The custom similarity returns 0 for all methods.
> : The results are kind of surprising. It doesn't look like the speed up is
> : enough to make the change to our application.
> :
> : Here are the results, the test class is also attached:
> :
> : time (mills) 14095, useHC=false, standardSimilarity=true, count=47,
> : query=abstract_recent:(genetically modified organism)
> : time (mills) 15406, useHC=false, standardSimilarity=false, count=0,
> : query=abstract_recent:(genetically modified organism)
> : time (mills) 13768, useHC=true, standardSimilarity=true, count=47,
> : query=abstract_recent:(genetically modified organism)
> : time (mills) 14404, useHC=true, standardSimilarity=false, count=47,
> : query=abstract_recent:(genetically modified organism)
> :
> :
> : time (mills) 6790, useHC=false, standardSimilarity=true, count=5776,
> : query=lname:smith
> : time (mills) 4901, useHC=false, standardSimilarity=false, count=0,
> : query=lname:smith
> : time (mills) 5209, useHC=true, standardSimilarity=true, count=5776,
> : query=lname:smith
> : time (mills) 5578, useHC=true, standardSimilarity=false, count=5776,
> : query=lname:smith
> :
> :
> : time (mills) 47, useHC=false, standardSimilarity=true, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> : time (mills) 37, useHC=false, standardSimilarity=false, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> : time (mills) 41, useHC=true, standardSimilarity=true, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> : time (mills) 198, useHC=true, standardSimilarity=false, count=0,
> : query=lname:dfdsalkfjdsalkjflsa
> :
> :
> :
> :
> : On Thursday 06 April 2006 15:19, Chris Hostetter wrote:
> : > : I need the count, and don't need the docs at this point. If I had a
> : > : simple query, (e.g. "book") I can use docFreq(), and it's lightning
> : > : fast. If I just run it as a query it's much slower. I'm just
> : > : wondering if I did a custom scorer / similarity / hitcollector, how
> : > : much faster than a query could I get it? Or is there a better way?
> : >
> : > A custom HitCollector would be the first big win, something like this
> : > would probably work...
> : >
> : >    final int[] count = new int[1]
> : >    searcher.search(query, new HitCollector() {
> : >        public void collect(int doc, float score) {
> : >           count[0]++;
> : >        }
> : >     });
> : >     return count[0]
> : >
> : > otherways you might be able to shave time would be...
> : >
> : >   * if your query can be represented as in simple set logic logic (you
> : >     don't seem to be concerned with score) then implimenting it as a
> : >     Filter may be faster becuase it won't do any score calculation, just
> a
> : >     simple match/no-match (which is what you seem to want) ... but it
> will
> : >     definitely take up more memory then a query
> : >
> : >   * if you customize your similarity so that every function returns 0 or
> 1
> : >     you might shave a little bit of time off by skipping some of the
> math
> : >     equations ... but i really doubt it.
> : >
> : >
> : >
> : >
> : > -Hoss
> : >
> : >
> : > ---------------------------------------------------------------------
> : > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> : > For additional commands, e-mail: java-user-help@lucene.apache.org
> :
> 
> 
> 
> -Hoss
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message