Return-Path: X-Original-To: apmail-lucene-lucene-net-user-archive@www.apache.org Delivered-To: apmail-lucene-lucene-net-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A5E3ECCE5 for ; Wed, 27 Jun 2012 16:39:33 +0000 (UTC) Received: (qmail 98525 invoked by uid 500); 27 Jun 2012 16:39:33 -0000 Delivered-To: apmail-lucene-lucene-net-user-archive@lucene.apache.org Received: (qmail 98461 invoked by uid 500); 27 Jun 2012 16:39:33 -0000 Mailing-List: contact lucene-net-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: lucene-net-user@lucene.apache.org Delivered-To: mailing list lucene-net-user@lucene.apache.org Received: (qmail 98452 invoked by uid 99); 27 Jun 2012 16:39:33 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2012 16:39:33 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of mmcconna@oxford-analytica.com designates 195.130.217.35 as permitted sender) Received: from [195.130.217.35] (HELO service20.mimecast.com) (195.130.217.35) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 Jun 2012 16:39:26 +0000 Received: from MAILER2.alfredst.oxford-analytica.local (116-142-12-86.static.virginmedia.com [86.12.142.116]) by service20.mimecast.com; Wed, 27 Jun 2012 17:39:05 +0100 Received: from BEDIVERE.alfredst.oxford-analytica.local ([192.168.8.139]) by MAILER2.alfredst.oxford-analytica.local with Microsoft SMTPSVC(6.0.3790.4675); Wed, 27 Jun 2012 17:39:03 +0100 Content-class: urn:content-classes:message MIME-Version: 1.0 X-MimeOLE: Produced By Microsoft Exchange V6.5 Subject: RE: Disparity between API usage and Luke Date: Wed, 27 Jun 2012 17:39:02 +0100 Message-ID: <71B5A2C57E0AD04F97E997AA244D455E05332E69@BEDIVERE.alfredst.oxford-analytica.local> In-Reply-To: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Disparity between API usage and Luke Thread-Index: Ac1Uf+tVneHIheBbSMyBkmbM/iQB/gAAK21A References: <71B5A2C57E0AD04F97E997AA244D455E03BFEE8E@BEDIVERE.alfredst.oxford-analytica.local> From: "Moray McConnachie" To: X-OriginalArrivalTime: 27 Jun 2012 16:39:03.0242 (UTC) FILETIME=[5C77DAA0:01CD5483] X-MC-Unique: 112062717390501602 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: quoted-printable I don't have time to write self-contained examples, but here're our keyword analyzer related classes.=20 Caveat: we are programming against an older version of Lucene.NET, and I haven't been keeping up with API changes, so this may not work in newer versions. However the principles should be the same. Although there may now be better ways of achieving this - a number of ways we "rolled our own" with earlier Lucene versions have ended up with better approaches using fewer custom classes. =09/// =09/// Trivial case-sensitive string analyzer for simple fields. =09/// =09public class lucSingleStringAnalyzer : Lucene.Net.Analysis.Analyzer =09{ =09=09/// =09=09/// instantiate =09=09/// =09=09public lucSingleStringAnalyzer():base() =09=09{ =09=09} =09=09/// =09=09/// The worker - simply applies FullTermTokenizer to the textreader =09=09/// =09=09/// Name of the field =09=09/// TextReader =09=09/// Standard Lucene TokenStream =09=09public override Lucene.Net.Analysis.TokenStream TokenStream(string fieldName, System.IO.TextReader reader) =09=09{ =09=09=09return new lucFullTermTokenizer(reader); =09=09} =09} =09/// =09/// Analyses a field by reading it all as a single string and lower casing it. =09/// =09public class lucLowerCaseSingleStringAnalyzer:Lucene.Net.Analysis.Analyzer =09{ =09=09/// =09=09/// instantiate =09=09/// =09=09public lucLowerCaseSingleStringAnalyzer():base() =09=09{ =09=09} =09=09/// =09=09/// return a lowercase filter on our custom tokenizer, i.e. the whole field is returned as a single lower case string, just one token. =09=09/// =09=09/// field name of stream =09=09/// TextReader =09=09/// standard Lucene.NET Tokenstream =09=09public override Lucene.Net.Analysis.TokenStream TokenStream(string fieldName, System.IO.TextReader reader) =09=09{ =09=09=09return new Lucene.Net.Analysis.LowerCaseFilter(new lucFullTermTokenizer(reader)); =09=09} =09} =09/// =09/// A class to read a full string and return all of it as a Lucene Token. =09/// =09/// Simple fields where the whole keyword is the relevant search term, not parts of it (e.g. United States should only be indexed as=20 =09/// "United States", not under "States" and "United", can be tokenized with this tokenizer. =09/// =09public class lucFullTermTokenizer: Lucene.Net.Analysis.Tokenizer =09{ =09=09/// =09=09/// Measure whether I have already read everything there is to read =09=09/// =09=09private bool blRead; =09=09/// =09=09/// instantiate =09=09/// =09=09public lucFullTermTokenizer():base() =09=09{ =09=09=09blRead=3Dfalse; =09=09} =09=09/// =09=09/// instantiate with text reader =09=09/// =09=09/// The TextReader passed on by Lucene =09=09public lucFullTermTokenizer(System.IO.TextReader input):base(input) =09=09{ =09=09=09blRead=3Dfalse; =09=09} =09=09 =09=09/// =09=09/// returns the next Token. This class returns a single Token per call, so Next should always return the string value of the field the first time, and null thereafter =09=09/// =09=09/// A new Lucene.Net Token, or null if there is nothing to read =09=09public override Lucene.Net.Analysis.Token Next() =09=09{ =09=09=09if (! blRead)=20 =09=09=09{ =09=09=09=09int i=3D0; =09=09=09=09int j; =09=09=09=09string str=3Dbase.input.ReadToEnd(); =09=09=09=09blRead=3Dtrue; =09=09=09=09j=3Dstr.Length; =09=09=09=09return new Lucene.Net.Analysis.Token(str,i,j-1); =09=09=09}=20 =09=09=09else=20 =09=09=09{ =09=09=09=09return null; =09=09=09} =09=09} =09} // AND HERE'S THE EXAMPLE OF THE PERFIELDANALYZERWRAPPER USING THE ABOVE /// /// Module containing generic helping hands for Lucene-related stuff. /// public static class lucUtils =09{ =09public static Lucene.Net.Analysis.Analyzer lucSpecialAnalyzer { =09=09=09get=20 =09=09=09{ =09 Lucene.Net.Analysis.PerFieldAnalyzerWrapper lucAnalyzer=3Dnew Lucene.Net.Analysis.PerFieldAnalyzerWrapper(new StandardAnalyzer); //default analyser is standard - in fact we use our own customised Porter stem analyzer here =09=09=09=09Lucene.Net.Analysis.Analyzer lcKeywordAnalyzer=3Dnew lucLowerCaseSingleStringAnalyzer(); =09=09=09=09Lucene.Net.Analysis.Analyzer KeywordAnalyzer=3Dnew lucSingleStringAnalyzer(); =09=09=09=09lucAnalyzer.AddAnalyzer("id", lcKeywordAnalyzer); =09=09=09=09lucAnalyzer.AddAnalyzer("product", KeywordAnalyzer); =09=09=09=09lucAnalyzer.AddAnalyzer("country", lcKeywordAnalyzer); =09=09=09=09return lucAnalyzer; =09=09=09} =09=09} } Then we can use the query parser with the same analyser. M. -----Original Message----- From: Rob Cecil [mailto:rob.cecil@gmail.com]=20 Sent: 27 June 2012 17:07 To: lucene-net-user@lucene.apache.org Subject: Re: Disparity between API usage and Luke Moray, Thanks I did catch that and been thinking about it. I finally have the LIA book so some of this stuff is starting to make more sense. Would you be willing to show your Keyword Analyzer class? thanks On Wed, Jun 27, 2012 at 1:57 AM, Moray McConnachie < mmcconna@oxford-analytica.com> wrote: > Rob, just in case you missed it in the dialogue earlier, let me=20 > recommend to your attention the PerFieldAnalyserWrapper mentioned by someone else. > This allows you to specify different analysers for different fields,=20 > but presents as a single analyser. So during indexing and searching to > benefit from analyser and query parser, and can index and search all=20 > fields with the analyser - no problems therefore having fields which are not analysed. > > For fields like Id we use our own version of keyword analyser which=20 > converts to lower case both on index and search but otherwise=20 > preserves the term entirely. > > The only slight problem is it makes it harder to use tools like Luke=20 > which use the standard analyser by default. > > Moray > ------------------------------------- > Moray McConnachie > Director of IT +44 1865 261 600 > Oxford Analytica http://www.oxan.com > > > ----- Original Message ----- > From: Rob Cecil [mailto:rob.cecil@gmail.com] > Sent: Tuesday, June 26, 2012 06:50 PM > To: lucene-net-user@lucene.apache.org=20 > > Subject: Disparity between API usage and Luke > > If I run a query against my index using QueryParser to query a field: > > var query =3D _parser.Parse("Id:BAUER*"); > var topDocs =3D searcher.Search(query, 10); > Assert.AreEqual(count, topDocs.TotalHits); > > I get 0 for my TotalHits, yet in Luke, the same query phrase yields 15 > results, what am I doing wrong? I use the StandardAnalyzer both to=20 > create the index and to query. > > The field is defined as: > > new Field("Id", myObject.Id, Field.Store.YES,=20 > Field.Index.NOT_ANALYZED) > > and is a string field. The result set back from Luke looks like > (screencap): > > http://screencast.com/t/NooMK2Rf > > Thanks! > > --------------------------------------------------------- > Disclaimer > > This message and any attachments are confidential and/or privileged.=20 > If this has been sent to you in error, please do not use, retain or=20 > disclose them, and contact the sender as soon as possible. > > Oxford Analytica Ltd > Registered in England: No. 1196703 > 5 Alfred Street, Oxford > United Kingdom, OX1 4EH > --------------------------------------------------------- > > --------------------------------------------------------- Disclaimer=20 This message and any attachments are confidential and/or privileged. If thi= s has been sent to you in error, please do not use, retain or disclose them= , and contact the sender as soon as possible. Oxford Analytica Ltd Registered in England: No. 1196703 5 Alfred Street, Oxford United Kingdom, OX1 4EH ---------------------------------------------------------