lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Why does this search fail?
Date Wed, 27 Aug 2014 15:07:57 GMT
It's not documented, but Google does seem to support trailing wildcard, but 
only if the prefix has at least six characters. For shorter prefixes, it 
seems to just drop the wildcard.

Google also uses "*" in quoted phrases to mean a placeholder for any single 
term. That's documented.

See:
https://support.google.com/websearch/answer/136861?hl=en

It also seems to support "**" in a quoted phrase to mean one or more 
arbitrary terms. This isn't documented, but seems to work.

-- Jack Krupansky

-----Original Message----- 
From: Milind
Sent: Wednesday, August 27, 2014 10:51 AM
To: java-user@lucene.apache.org
Subject: Re: Why does this search fail?

Yes.  If you search for alphare on google and alphare*, you get 2 different
results.  Sorry for the contrived example.  I just tried searching for
alpharetta and went backwards deleting characters.


On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies <benson@basistech.com>
wrote:

> Does google actually support "*"?
>
>
>
> On Wed, Aug 27, 2014 at 9:54 AM, Milind <milindr@gmail.com> wrote:
>
> > I see.  This is going to be extremely difficult to explain to end users.
> > It doesn't work as they would expect.  Some of the tokenizing rules are
> > already somewhat confusing.  Their expectation is that it should work 
> > the
> > way their searches work in Google.
> >
> > It's difficult enough to recognize that because the period is surrounded
> by
> > a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
> > tokenized.  So I'd have expected that C0001.DevNm00* would effectively
> > become a search for C0001 OR DevNm00*.  But now, because of the presence
> of
> > the wildcard, it's considered as 1 term and the period is not a
> tokenizer.
> > That's actually good, but now the fact that it's still considered as 2
> > terms for wildcard searches makes it very unintuitive.  I don't suppose
> > that I can do anything about making wildcard search use multiple terms 
> > if
> > joined together with a tokenizer.  But is there any way that I can force
> it
> > to go through an analyzer prior to doing the search?
> >
> >
> >
> >
> > On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <jack@basetechnology.com
> >
> > wrote:
> >
> > > Sorry, but you can only use a wildcard on a single term.
> "C0001.DevNm001"
> > > gets indexed as two terms, "c0001" and "devnm001", so your wildcard
> won't
> > > match any term (at least in this case.)
> > >
> > > Also, if your query term includes a wildcard, it will not be fully
> > > analyzed. Some filters such as lower case are defined as "multi-term",
> so
> > > they will be performed, but the standard tokenizer is not being 
> > > called,
> > so
> > > the dot remains and this whole term is treated as one term, unlike the
> > > index analysis.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: Milind
> > > Sent: Tuesday, August 26, 2014 12:24 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Why does this search fail?
> > >
> > >
> > > I have a field with the value C0001.DevNm001.  If I search for
> > >
> > >    C0001.DevNm001 --> Get Hit
> > >    DevNm00*       --> Get Hit
> > >    C0001.DevNm00*  --> Get No Hit
> > >
> > > The field gets tokenized on the period since it's surrounded by a
> letter
> > > and and a number.  The query gets evaluated as a prefix query.  I'd
> have
> > > thought that this should have found the document.  Any clues on why
> this
> > > doesn't work?
> > >
> > > The full code is below.
> > >
> > >        Directory theDirectory = new RAMDirectory();
> > >        Version theVersion = Version.LUCENE_47;
> > >        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
> > >        IndexWriterConfig theConfig =
> > >                            new IndexWriterConfig(theVersion,
> > theAnalyzer);
> > >        IndexWriter theWriter = new IndexWriter(theDirectory,
> theConfig);
> > >
> > >        String theFieldName = "Name";
> > >        String theFieldValue = "C0001.DevNm001";
> > >          Document theDocument = new Document();
> > >          theDocument.add(new TextField(theFieldName, theFieldValue,
> > > Field.Store.YES));
> > >          theWriter.addDocument(theDocument);
> > >        theWriter.close();
> > >
> > >        String theQueryStr = theFieldName + ":C0001.DevNm00*";
> > >        Query theQuery =
> > >            new QueryParser(theVersion, theFieldName,
> > > theAnalyzer).parse(theQueryStr);
> > >        System.out.println(theQuery.getClass() + ", " + theQuery);
> > >        IndexReader theIndexReader = 
> > > DirectoryReader.open(theDirectory);
> > >        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
> > >        TopScoreDocCollector collector = 
> > > TopScoreDocCollector.create(10,
> > > true);
> > >        theSearcher.search(theQuery, collector);
> > >        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
> > >        System.out.println("Hits found: " + theHits.length);
> > >
> > > Output:
> > >
> > > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
> > > Hits found: 0
> > >
> > >
> > > --
> > > Regards
> > > Milind
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
> >
> > --
> > Regards
> > Milind
> >
>



-- 
Regards
Milind 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message