lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind <mili...@gmail.com>
Subject Re: Why does this search fail?
Date Wed, 27 Aug 2014 17:15:02 GMT
Thanks for the Google link.  I wasn't aware of it.  Most of it is very
intuitive.  And most importantly consistent.


On Wed, Aug 27, 2014 at 11:07 AM, Jack Krupansky <jack@basetechnology.com>
wrote:

> It's not documented, but Google does seem to support trailing wildcard,
> but only if the prefix has at least six characters. For shorter prefixes,
> it seems to just drop the wildcard.
>
> Google also uses "*" in quoted phrases to mean a placeholder for any
> single term. That's documented.
>
> See:
> https://support.google.com/websearch/answer/136861?hl=en
>
> It also seems to support "**" in a quoted phrase to mean one or more
> arbitrary terms. This isn't documented, but seems to work.
>
>
> -- Jack Krupansky
>
> -----Original Message----- From: Milind
> Sent: Wednesday, August 27, 2014 10:51 AM
> To: java-user@lucene.apache.org
> Subject: Re: Why does this search fail?
>
>
> Yes.  If you search for alphare on google and alphare*, you get 2 different
> results.  Sorry for the contrived example.  I just tried searching for
> alpharetta and went backwards deleting characters.
>
>
> On Wed, Aug 27, 2014 at 10:01 AM, Benson Margulies <benson@basistech.com>
> wrote:
>
>  Does google actually support "*"?
>>
>>
>>
>> On Wed, Aug 27, 2014 at 9:54 AM, Milind <milindr@gmail.com> wrote:
>>
>> > I see.  This is going to be extremely difficult to explain to end users.
>> > It doesn't work as they would expect.  Some of the tokenizing rules are
>> > already somewhat confusing.  Their expectation is that it should work >
>> the
>> > way their searches work in Google.
>> >
>> > It's difficult enough to recognize that because the period is surrounded
>> by
>> > a digit and alphabet (as opposed to 2 digits or 2 alphabets), it gets
>> > tokenized.  So I'd have expected that C0001.DevNm00* would effectively
>> > become a search for C0001 OR DevNm00*.  But now, because of the presence
>> of
>> > the wildcard, it's considered as 1 term and the period is not a
>> tokenizer.
>> > That's actually good, but now the fact that it's still considered as 2
>> > terms for wildcard searches makes it very unintuitive.  I don't suppose
>> > that I can do anything about making wildcard search use multiple terms
>> > if
>> > joined together with a tokenizer.  But is there any way that I can force
>> it
>> > to go through an analyzer prior to doing the search?
>> >
>> >
>> >
>> >
>> > On Tue, Aug 26, 2014 at 4:21 PM, Jack Krupansky <
>> jack@basetechnology.com
>> >
>> > wrote:
>> >
>> > > Sorry, but you can only use a wildcard on a single term.
>> "C0001.DevNm001"
>> > > gets indexed as two terms, "c0001" and "devnm001", so your wildcard
>> won't
>> > > match any term (at least in this case.)
>> > >
>> > > Also, if your query term includes a wildcard, it will not be fully
>> > > analyzed. Some filters such as lower case are defined as "multi-term",
>> so
>> > > they will be performed, but the standard tokenizer is not being > >
>> called,
>> > so
>> > > the dot remains and this whole term is treated as one term, unlike the
>> > > index analysis.
>> > >
>> > > -- Jack Krupansky
>> > >
>> > > -----Original Message----- From: Milind
>> > > Sent: Tuesday, August 26, 2014 12:24 PM
>> > > To: java-user@lucene.apache.org
>> > > Subject: Why does this search fail?
>> > >
>> > >
>> > > I have a field with the value C0001.DevNm001.  If I search for
>> > >
>> > >    C0001.DevNm001 --> Get Hit
>> > >    DevNm00*       --> Get Hit
>> > >    C0001.DevNm00*  --> Get No Hit
>> > >
>> > > The field gets tokenized on the period since it's surrounded by a
>> letter
>> > > and and a number.  The query gets evaluated as a prefix query.  I'd
>> have
>> > > thought that this should have found the document.  Any clues on why
>> this
>> > > doesn't work?
>> > >
>> > > The full code is below.
>> > >
>> > >        Directory theDirectory = new RAMDirectory();
>> > >        Version theVersion = Version.LUCENE_47;
>> > >        Analyzer theAnalyzer = new StandardAnalyzer(theVersion);
>> > >        IndexWriterConfig theConfig =
>> > >                            new IndexWriterConfig(theVersion,
>> > theAnalyzer);
>> > >        IndexWriter theWriter = new IndexWriter(theDirectory,
>> theConfig);
>> > >
>> > >        String theFieldName = "Name";
>> > >        String theFieldValue = "C0001.DevNm001";
>> > >          Document theDocument = new Document();
>> > >          theDocument.add(new TextField(theFieldName, theFieldValue,
>> > > Field.Store.YES));
>> > >          theWriter.addDocument(theDocument);
>> > >        theWriter.close();
>> > >
>> > >        String theQueryStr = theFieldName + ":C0001.DevNm00*";
>> > >        Query theQuery =
>> > >            new QueryParser(theVersion, theFieldName,
>> > > theAnalyzer).parse(theQueryStr);
>> > >        System.out.println(theQuery.getClass() + ", " + theQuery);
>> > >        IndexReader theIndexReader = > > DirectoryReader.open(
>> theDirectory);
>> > >        IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>> > >        TopScoreDocCollector collector = > >
>> TopScoreDocCollector.create(10,
>> > > true);
>> > >        theSearcher.search(theQuery, collector);
>> > >        ScoreDoc[] theHits = collector.topDocs().scoreDocs;
>> > >        System.out.println("Hits found: " + theHits.length);
>> > >
>> > > Output:
>> > >
>> > > class org.apache.lucene.search.PrefixQuery, Name:c0001.devnm00*
>> > > Hits found: 0
>> > >
>> > >
>> > > --
>> > > Regards
>> > > Milind
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > > For additional commands, e-mail: java-user-help@lucene.apache.org
>> > >
>> > >
>> >
>> >
>> > --
>> > Regards
>> > Milind
>> >
>>
>>
>
>
> --
> Regards
> Milind
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Regards
Milind

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message