lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jack Krupansky <jack.krupan...@gmail.com>
Subject Re: Text search NGram
Date Mon, 07 Mar 2016 15:45:39 GMT
Absolutely, but so what? Nothing in any Solr query is going to be based on
character position.

Also, adding and removing characters in a char filter is a really bad idea
if you might want to do highlighting since the token character position
would not line up with the original source text.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 10:33 AM, G, Rajesh <rg@cebglobal.com> wrote:

> Hi Jack,
>
>
>
> Please correct me if iam wrong I added Char filter because
>
>
>
> In Analyzer[solr ui]  I have provided "Microsoft office" in Field Value
> (Index) now WhitespaceTokenizerFactory produces the below result Office
> starts at 10. if I leave additional space say 2 more spaces Office starts
> at 12 should it not start at 10?
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 10
>
>
> 16
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 12
>
>
> 18
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> Thanks
>
> Rajesh
>
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>
> -----Original Message-----
> From: Jack Krupansky [mailto:jack.krupansky@gmail.com]
> Sent: Monday, March 7, 2016 8:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text search NGram
>
>
>
> The charFilter isn't doing anything useful - the white space tokenzier
> will ignore extra white space anyway.
>
>
>
> -- Jack Krupansky
>
>
>
> On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh <rg@cebglobal.com<mailto:
> rg@cebglobal.com>> wrote:
>
>
>
> > Hi Team,
>
> >
>
> > We have the blow type and we have indexed the value  "title":
>
> > "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio
>
> > 8.0.61205.56 (2005)"
>
> >
>
> > When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
>
> > Microsoft Visual Studio 2006 as first record. I wanted to have
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the
>
> > user has searched for Microsoft Visual Studio 2005. Can you please help?.
>
> >
>
> > We are using NGram so it takes care of misspelled or jumbled words[it
>
> > works as expected] e.g.
>
> > searching Micrs Visual Studio will gets Microsoft Visual Studio
>
> > searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> >
>
> >   <fieldType name="txt_token" class="solr.TextField"
>
> > positionIncrementGap="0" >
>
> >                 <analyzer type="index">
>
> >                                 <charFilter
>
> > class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement="
> "/>
>
> >                                 <tokenizer
>
> > class="solr.WhitespaceTokenizerFactory"/>
>
> >                                 <filter
>
> > class="solr.LowerCaseFilterFactory"/>
>
> >                                 <filter class="solr.NGramFilterFactory"
>
> > minGramSize="2" maxGramSize="800"/>
>
> >                 </analyzer>
>
> >                  <analyzer type="query">
>
> >                                 <charFilter
>
> > class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement="
> "/>
>
> >                                 <tokenizer
>
> > class="solr.WhitespaceTokenizerFactory"/>
>
> >                                 <filter
>
> > class="solr.LowerCaseFilterFactory"/>
>
> >                                 <filter class="solr.NGramFilterFactory"
>
> > minGramSize="2" maxGramSize="800"/>
>
> >                 </analyzer>
>
> >   </fieldType>
>
> >
>
> >
>
> >
>
> > Corporate Executive Board India Private Limited. Registration No:
>
> > U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF
>
> > Building
>
> > No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
> >
>
> >
>
> >
>
> > This e-mail and/or its attachments are intended only for the use of
>
> > the
>
> > addressee(s) and may contain confidential and legally privileged
>
> > information belonging to CEB and/or its subsidiaries, including CEB
>
> > subsidiaries that offer SHL Talent Measurement products and services.
>
> > If you have received this e-mail in error, please notify the sender
>
> > and immediately, destroy all copies of this email and its attachments.
>
> > The publication, copying, in whole or in part, or use or dissemination
>
> > in any other way of this e-mail and attachments by anyone other than
>
> > the intended
>
> > person(s) is prohibited.
>
> >
>
> >
>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message