lucenenet-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Granroth, Neal V." <neal.granr...@thermofisher.com>
Subject RE: [Lucene.Net] Indexing Oddity
Date Mon, 26 Sep 2011 16:01:45 GMT
Yes Tim,

The Standard Analyzer is primarily intended for text documents containing English words and
phrases and certain other common pieces of information such as acronyms, phone numbers, and
email addresses.

The Standard Analyzer may not the best choice for the formatted, coded, information that you
are indexing as it may or may not split the text at a comma based on the text which follows.

If you have not seen it before, here is the source for a small command-line example that displays
the tokens produced by a selected analyzer and a given input string. For example:

Standard Analyzer

C:\>ADemo 1 "DNE,APLU,GB11/0290"
[1]: "dne"
[2]: "aplu,gb11/0290"


Whitespace Analyzer

C:\>ADemo 2 "DNE,APLU,GB11/0290"
[1]: "DNE,APLU,GB11/0290"


------------------------------------------------------------------

static void Main(string[] args)
{
   if (args.Length < 2)
      return;
				
   int selector = 0;
				
   if ( ! Int32.TryParse( args[0], out selector ) )
       return;
				
   Lucene.Net.Analysis.Analyzer analyzer = null;
			
   switch( Int32.Parse( args[0]) )
   {
      case 4:
         analyzer = new Lucene.Net.Analysis.SimpleAnalyzer();
         break;

      case 3:
         analyzer = new Lucene.Net.Analysis.StopAnalyzer();
         break;

      case 2:
         analyzer = new Lucene.Net.Analysis.WhitespaceAnalyzer();
         break;

      case 1:
      default:
         analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer();
         break;

   }
			
   Lucene.Net.Analysis.TokenStream tks = analyzer.TokenStream(
       new System.IO.StringReader(args[1]) );
				
   int tkNum = 1;
   Lucene.Net.Analysis.Token curToken = tks.Next();
   while(curToken != null)
   {
      System.Console.WriteLine("[{0}]: \"{1}\"",
         tkNum++, curToken.TermText() );

      curToken = tks.Next();
   }
}

------------------------------------------------------------------

- Neal

-----Original Message-----
From: Tim Haughton [mailto:timhaughton@gmail.com] 
Sent: Monday, September 26, 2011 7:41 AM
To: lucene-net-user@lucene.apache.org
Subject: Re: [Lucene.Net] Indexing Oddity

internal static void AddToContentIndex(EDMDocument document, string
fullText)
        {
            lock (contentMutex)
            {
                IndexWriter writer = null;
                try
                {
                    EnsureContentIndexIsUnlocked();

                    // Add content
                    var contentIndexFolder = new
FileInfo(App.ContentIndexFolder);
                    writer = new IndexWriter(contentIndexFolder, new
StandardAnalyzer(), false);
                    writer.SetUseCompoundFile(true);

                    var contentDoc = new Document();
                    contentDoc.Add(new Field("content", fullText,
Field.Store.NO, Field.Index.TOKENIZED));
                    contentDoc.Add(new Field("documentID",
document.DocumentID, Field.Store.YES,
                                             Field.Index.UN_TOKENIZED));

                    writer.AddDocument(contentDoc);
                    writer.Optimize();
                }
                catch (Exception exception)
                {
                    log.Error("Problem adding document to content index.",
exception);
                }
                finally
                {
                    if (writer != null)
                    {
                        writer.Close();
                    }
                }
            }
        }

Cheers,

Tim


On 26 September 2011 13:37, Itamar Syn-Hershko <itamar@code972.com> wrote:

> No, you are probably using KeywordAnalyzer
>
> What is your indexing code?
>
> On Mon, Sep 26, 2011 at 3:28 PM, Tim Haughton <timhaughton@gmail.com>
> wrote:
>
> > Hi, I'm trying to index a text file containing the following text:
> >
> > DNE,APLU,GB11/0290
> > DNE,CMDU,11-1431
> > DNE,EGLV,NO CONTRACT
> > DNE,HJSC,ANE112376
> > DNE,HLCU,NO CONTRACT
> > DNE,MAEU,547712
> > DNE,MOLU,NO CONTRACT
> > DNE,OOLU,AE115029
> >
> > It appears that each "line" is being indexed as one complete string,
> rather
> > than at least 3 terms. So if I search for "547712" I get no results. But
> if
> > I search for "DNE,MAEU,547712" I find the document. If I add a space
> after
> > each comma it indexes them as individual terms.
> >
> > Is this expected behaviour using the StandardAnalyzer?
> >
> > Cheers,
> >
> > Tim
> >
>

Mime
View raw message