lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Huntsman84 <tpgarci...@gmail.com>
Subject Re: RegexQuery Incomplete Results
Date Mon, 11 May 2009 16:06:12 GMT

That's it!!!

The problem was with the regular expression, the one I need is ".*IN"!!

Thank you so much, I was turning mad... =)


Ian Lea wrote:
> 
> The little self-contained program below runs regex queries for a few
> regexps against a few phrases for both the java.util and jakarta
> regexp packages.
> 
> Output when run with lucene 2.4.1 and jakarta-regexp 1.5 is
> 
> Added Knowing yourself
> Added Old clinic
> Added INSIDE
> Added Not INSIDE
> 
> Default
> RegexCapabilities=org.apache.lucene.search.regex.JavaUtilRegexCapabilities@0
> 
> org.apache.lucene.search.regex.JavaUtilRegexCapabilities@0
> 0 hits for text:.in
> 2 hits for text:.*in
> 0 hits for text:.IN
> 2 hits for text:.*IN
> org.apache.lucene.search.regex.JakartaRegexpCapabilities@0
> 2 hits for text:.in
> 2 hits for text:.*in
> 1 hits for text:.IN
> 2 hits for text:.*IN
> 
> Hope that helps.
> 
> --
> Ian.
> 
> 
> import org.apache.lucene.index.*;
> import org.apache.lucene.store.*;
> import org.apache.lucene.document.*;
> import org.apache.lucene.analysis.*;
> import org.apache.lucene.analysis.standard.*;
> import org.apache.lucene.search.*;
> import org.apache.lucene.search.regex.*;
> 
> public class luctest {
> 
>     public static void main(String[] _args) throws Exception {
> 	RAMDirectory rdir = new RAMDirectory();
> 	IndexWriter writer = new IndexWriter(rdir, new StandardAnalyzer(), true);
> 	String[] docterms = { "Knowing yourself",
> 			      "Old clinic",
> 			      "INSIDE",
> 			      "Not INSIDE" };
> 
> 	for (String s : docterms) {
> 	    Document d = new Document();
> 	    d.add(new Field("text",
> 			    s,
> 			    Field.Store.YES,
> 			    Field.Index.NOT_ANALYZED));
> 	    writer.addDocument(d);
> 	    System.out.printf("Added %s\n", s);
> 	}
> 	writer.close();
> 
> 	IndexSearcher searcher = new IndexSearcher(rdir);
> 	String[] queries = { ".in", ".*in", ".IN", ".*IN" };
> 	RegexCapabilities[] rcaps = { new JavaUtilRegexCapabilities(),
> 				      new JakartaRegexpCapabilities() };
> 	RegexQuery qx = new RegexQuery(new Term("x", "x"));
> 	System.out.printf("\nDefault RegexCapabilities=%s\n\n",
> 			  qx.getRegexImplementation());
> 	for (RegexCapabilities rcap : rcaps) {
> 	    System.out.println(rcap);
> 	    for (String s : queries) {
> 		Term t = new Term("text", s);
> 		RegexQuery q = new RegexQuery(t);
> 		q.setRegexImplementation(rcap);
> 		Hits h = searcher.search(q);
> 		System.out.printf("%s hits for %s\n",
> 				  h.length(),
> 				  q.toString());
> 	    }
> 	}
>     }
> }
> 
> 
> On Mon, May 11, 2009 at 1:39 PM, Huntsman84 <tpgarcia84@gmail.com> wrote:
>>
>> The RegexQuery class uses that package, and for that reason the
>> expression
>> matches.
>>
>> If my records contained only one word each, this code would work, but I
>> need
>> to apply that regular expression to a phrase...
>>
>>
>> Ian Lea wrote:
>>>
>>> The default regex package is java.util.regex and I can't see anywhere
>>> that you tell it to use the Jakarta regexp package.  So I don't think
>>> that ".in" will match.  Also, you are storing your contents field as
>>> NOT_ANALYZED so you will need to be wary of case sensitivity.  Maybe
>>> this is what you want, but maybe not.
>>>
>>>
>>> --
>>> Ian.
>>>
>>>
>>> On Mon, May 11, 2009 at 9:00 AM, Huntsman84 <tpgarcia84@gmail.com>
>>> wrote:
>>>>
>>>> This is the code for searching:
>>>>
>>>> String index = "index";
>>>> String field = "contents";
>>>> IndexReader reader = IndexReader.open(index);
>>>> Searcher searcher = new IndexSearcher(reader);
>>>>
>>>> System.out.println("Enter query: ");
>>>> String line = ".IN.";//in jakarta regexp this is like * IN *
>>>> RegexQuery rxquery = new RegexQuery(new Term(field,line));
>>>> Hits hits = searcher.search(rxquery);
>>>>
>>>> if(hits!=null){
>>>>    for(int k = 0; k<100 && k<hits.length(); k++){
>>>>        if(hits.doc(k)!=null)
>>>>
>>>>  System.out.println(hits.doc(k).getField("contents").stringValue());
>>>>    }
>>>> }
>>>>
>>>>
>>>>
>>>> And this is the part of creating the index:
>>>>
>>>>
>>>> File directory = new File("index");
>>>> IndexWriter writer = new IndexWriter(directory, new StandardAnalyzer(),
>>>> true,
>>>>                            IndexWriter.MaxFieldLength.LIMITED);
>>>> List<String> records = getRecords();//returns a list of record values
>>>> from
>>>> database, all of them are phrases
>>>> Iterator<String> i = records.iterator();
>>>> while(i.hasNext()){
>>>>           Document doc = new Document();
>>>>           doc.add(new Field(field, i.next(), Field.Store.YES,
>>>> Field.Index.NOT_ANALYZED));
>>>>        writer.addDocument(doc);
>>>> }
>>>> writer.optimize();
>>>> writer.close();
>>>>
>>>>
>>>>
>>>> This code works as I want but just matching with the first word of the
>>>> phrase. I think the problem is the index building, but I don't know how
>>>> to
>>>> fix it...
>>>>
>>>> Any ideas?
>>>>
>>>> Thank you so much!!
>>>>
>>>>
>>>>
>>>> Steven A Rowe wrote:
>>>>>
>>>>> On 5/8/2009 at 9:13 AM, Ian Lee wrote:
>>>>>> I'm surprised that it matches either - don't you need ".*in" where
.*
>>>>>> means match any character zero or more times?  See the javadoc for
>>>>>> java.util.regex.Pattern, or for Jakarta Regexp if you are using that
>>>>>> package.
>>>>>>
>>>>>> Unless you're an expert in regexps it is probably worth playing with
>>>>>> them outside your lucene code to start with e.g. with simple
>>>>>> String.matches(regexp) calls.  They can take some getting used to.
>>>>>> And try to avoid anything with backslashes if you can!
>>>>>
>>>>> The java.util.regex.Pattern implementation (the default RegexQuery
>>>>> implementation) actually uses Matcher.lookingAt(), which is equivalent
>>>>> to
>>>>> prepending a "^" anchor to the beginning of the pattern, so if
>>>>> Huntsman84
>>>>> is using the default implementation, then I agree with Ian: I'm
>>>>> surprised
>>>>> it matches either.
>>>>>
>>>>> However, the Jakarta Regexp implementation uses RE.match(), which does
>>>>> *not* require a beginning-of-string match.
>>>>>
>>>>> Hunstman84, are you using the Jakarta Regexp implementation?  If so,
>>>>> then
>>>>> like you, I'm surprised it's not matching both :).
>>>>>
>>>>> It would be useful to see some real code, including how you index your
>>>>> records.
>>>>>
>>>>> Steve
>>>>>
>>>>>> On Fri, May 8, 2009 at 1:42 PM, Huntsman84 <tpgarcia84@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi,
>>>>>> >
>>>>>> > I am using RegexQuery for searching in a set of records wich
are
>>>>>> > phrases of several words each. My aim is to find any phrase
that
>>>>>> > contains the given group of letters (e.g. "in"). For that case,
>>>>>> > I am building the query with the regular expression ".in.",
so it
>>>>>> > should return all phrases with contain "in", but the search
only
>>>>>> > matches with the first word of the phrase.
>>>>>> >
>>>>>> > For example, if my records are "Knowing yourself" and "Old
>>>>>> > clinic", the correct search would return 2 matches, but it only
>>>>>> > matches with "Knowing yourself".
>>>>>> >
>>>>>> > How could I fix this?
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23478720.html
>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23482532.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/RegexQuery-Incomplete-Results-tp23445235p23486350.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message