lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Moray McConnachie" <mmcco...@oxford-analytica.com>
Subject RE: Disparity between API usage and Luke
Date Wed, 27 Jun 2012 16:39:02 GMT
I don't have time to write self-contained examples, but here're our
keyword analyzer related classes. 

Caveat: we are programming against an older version of Lucene.NET, and I
haven't been keeping up with API changes, so this may not work in newer
versions. However the principles should be the same. Although there may
now be better ways of achieving this - a number of ways we "rolled our
own" with earlier Lucene versions have ended up with better approaches
using fewer custom classes.

	/// <summary>
	/// Trivial case-sensitive string analyzer for simple fields.
	/// </summary>
	public class lucSingleStringAnalyzer :
Lucene.Net.Analysis.Analyzer
	{
		/// <summary>
		/// instantiate
		/// </summary>
		public lucSingleStringAnalyzer():base()
		{
		}
		/// <summary>
		/// The worker - simply applies FullTermTokenizer to the
textreader
		/// </summary>
		/// <param name="fieldName">Name of the field</param>
		/// <param name="reader">TextReader</param>
		/// <returns>Standard Lucene TokenStream</returns>
		public override Lucene.Net.Analysis.TokenStream
TokenStream(string fieldName, System.IO.TextReader reader)
		{
			return new lucFullTermTokenizer(reader);
		}

	}


	/// <summary>
	/// Analyses a field by reading it all as a single string and
lower casing it.
	/// </summary>
	public class
lucLowerCaseSingleStringAnalyzer:Lucene.Net.Analysis.Analyzer
	{
		/// <summary>
		/// instantiate
		/// </summary>
		public lucLowerCaseSingleStringAnalyzer():base()
		{
		}
		/// <summary>
		/// return a lowercase filter on our custom <see
cref="lucFullTermTokenizer">tokenizer</see>, i.e. the whole field is
returned as a single lower case string, just one token.
		/// </summary>
		/// <param name="fieldName">field name of stream</param>
		/// <param name="reader">TextReader</param>
		/// <returns>standard Lucene.NET Tokenstream</returns>
		public override Lucene.Net.Analysis.TokenStream
TokenStream(string fieldName, System.IO.TextReader reader)
		{
			return new
Lucene.Net.Analysis.LowerCaseFilter(new lucFullTermTokenizer(reader));
		}

	}

	/// <summary>
	/// A class to read a full string and return all of it as a
Lucene Token.
	/// </summary>
	/// <remarks> Simple fields where the whole keyword is the
relevant search term, not parts of it (e.g. United States should only be
indexed as 
	/// "United States", not under "States" and "United", can be
tokenized with this tokenizer.
	/// </remarks>
	public class lucFullTermTokenizer: Lucene.Net.Analysis.Tokenizer
	{
		/// <summary>
		/// Measure whether I have already read everything there
is to read
		/// </summary>
		private bool blRead;
		/// <summary>
		/// instantiate
		/// </summary>
		public lucFullTermTokenizer():base()
		{
			blRead=false;
		}
		/// <summary>
		/// instantiate with text reader
		/// </summary>
		/// <param name="input">The TextReader passed on by
Lucene</param>
		public lucFullTermTokenizer(System.IO.TextReader
input):base(input)
		{
			blRead=false;
		}
		
		/// <summary>
		/// returns the next Token. This class returns a single
Token per call, so Next should always return the string value of the
field the first time, and null thereafter
		/// </summary>
		/// <returns>A new Lucene.Net Token, or null if there is
nothing to read</returns>
		public override Lucene.Net.Analysis.Token Next()
		{
			if (! blRead) 
			{
				int i=0;
				int j;
				string str=base.input.ReadToEnd();
				blRead=true;
				j=str.Length;
				return new
Lucene.Net.Analysis.Token(str,i,j-1);
			} 
			else 
			{
				return null;
			}

		}
	}

// AND HERE'S THE EXAMPLE OF THE PERFIELDANALYZERWRAPPER USING THE ABOVE
/// <summary>
/// Module containing generic helping hands for Lucene-related stuff.
/// </summary>
public static class lucUtils
	{
	public static Lucene.Net.Analysis.Analyzer lucSpecialAnalyzer {
			get 
			{
	
Lucene.Net.Analysis.PerFieldAnalyzerWrapper lucAnalyzer=new
Lucene.Net.Analysis.PerFieldAnalyzerWrapper(new StandardAnalyzer);
//default analyser is standard - in fact we use our own customised
Porter stem analyzer here
				Lucene.Net.Analysis.Analyzer
lcKeywordAnalyzer=new lucLowerCaseSingleStringAnalyzer();
				Lucene.Net.Analysis.Analyzer
KeywordAnalyzer=new lucSingleStringAnalyzer();
				lucAnalyzer.AddAnalyzer("id",
lcKeywordAnalyzer);
				lucAnalyzer.AddAnalyzer("product",
KeywordAnalyzer);
				lucAnalyzer.AddAnalyzer("country",
lcKeywordAnalyzer);
				return lucAnalyzer;
			}
		}
}


Then we can use the query parser with the same analyser.

M.
-----Original Message-----
From: Rob Cecil [mailto:rob.cecil@gmail.com] 
Sent: 27 June 2012 17:07
To: lucene-net-user@lucene.apache.org
Subject: Re: Disparity between API usage and Luke

Moray, Thanks I did catch that and been thinking about it. I finally
have the LIA book so some of this stuff is starting to make more sense.
Would you be willing to show your Keyword Analyzer class?

thanks

On Wed, Jun 27, 2012 at 1:57 AM, Moray McConnachie <
mmcconna@oxford-analytica.com> wrote:

> Rob, just in case you missed it in the dialogue earlier, let me 
> recommend to your attention the PerFieldAnalyserWrapper mentioned by
someone else.
> This allows you to specify different analysers for different fields, 
> but presents as a single analyser. So during indexing and searching to

> benefit from analyser and query parser, and can index and search all 
> fields with the analyser - no problems therefore having fields which
are not analysed.
>
> For fields like Id we use our own version of keyword analyser which 
> converts to lower case both on index and search but otherwise 
> preserves the term entirely.
>
> The only slight problem is it makes it harder to use tools like Luke 
> which use the standard analyser by default.
>
> Moray
> -------------------------------------
> Moray McConnachie
> Director of IT    +44 1865 261 600
> Oxford Analytica  http://www.oxan.com
>
>
> ----- Original Message -----
> From: Rob Cecil [mailto:rob.cecil@gmail.com]
> Sent: Tuesday, June 26, 2012 06:50 PM
> To: lucene-net-user@lucene.apache.org 
> <lucene-net-user@lucene.apache.org>
> Subject: Disparity between API usage and Luke
>
> If I run a query against my index using QueryParser to query a field:
>
>                var query = _parser.Parse("Id:BAUER*");
>                var topDocs = searcher.Search(query, 10);
>                Assert.AreEqual(count, topDocs.TotalHits);
>
> I get 0 for my TotalHits, yet in Luke, the same query phrase yields 15

> results, what am I doing wrong? I use the StandardAnalyzer both to 
> create the index and to query.
>
> The field is defined as:
>
> new Field("Id", myObject.Id, Field.Store.YES, 
> Field.Index.NOT_ANALYZED)
>
> and is a string field. The result set back from Luke looks like
> (screencap):
>
> http://screencast.com/t/NooMK2Rf
>
> Thanks!
>
> ---------------------------------------------------------
> Disclaimer
>
> This message and any attachments are confidential and/or privileged. 
> If this has been sent to you in error, please do not use, retain or 
> disclose them, and contact the sender as soon as possible.
>
> Oxford Analytica Ltd
> Registered in England: No. 1196703
> 5 Alfred Street, Oxford
> United Kingdom, OX1 4EH
> ---------------------------------------------------------
>
>

---------------------------------------------------------
Disclaimer 

This message and any attachments are confidential and/or privileged. If this has been sent
to you in error, please do not use, retain or disclose them, and contact the sender as soon
as possible.

Oxford Analytica Ltd
Registered in England: No. 1196703
5 Alfred Street, Oxford
United Kingdom, OX1 4EH
---------------------------------------------------------


Mime
View raw message