lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Milind <mili...@gmail.com>
Subject Re: Can't get case insensitive keyword analyzer to work
Date Tue, 12 Aug 2014 15:36:08 GMT
Thanks Christoph,

So it seems that tokenized has been conflated to analyzed.  I just looked
at the Javadocs and that's what it mentions. I had read it earlier, but it
hadn't registered.  I wonder why it's not called setAnalyzed.  Thanks again.


On Tue, Aug 12, 2014 at 3:07 AM, Christoph Kaser <
christoph.kaser@iconparc.de> wrote:

> Hello Milind,
>
> if you don't set the field to be tokenized, no analyzer will be used and
> the field's contents will be stored "as-is", i.e. case sensitive.
> It's the analyzer's job to tokenize the input, so if you use an analyzer
> that does not separate the input into several tokens (like the
> KeywordAnalyzer), your input will remain "untokenized".
>
> Regards
> Christoph
>
> Am 12.08.2014 um 03:38 schrieb Milind:
>
>  I found the problem.  But it makes no sense to me.
>>
>> If I set the field type to be tokenized, it works.  But if I set it to not
>> be tokenized the search fails.  i.e. I have to pass in true to the method.
>>      theFieldType.setTokenized(storeTokenized);
>>
>> I want the field to be stored as un-tokenized.  But it seems that I don't
>> need to do that.  The LowerCaseKeywordAnalyzer works if the field is
>> tokenized, but not if it's un-tokenized!
>>
>> How can that be?
>>
>>
>> On Mon, Aug 11, 2014 at 1:49 PM, Milind <milindr@gmail.com> wrote:
>>
>>  It does look like the lowercase is working.
>>>
>>> The following code
>>>
>>>          Document theDoc = theIndexReader.document(0);
>>>          System.out.println(theDoc.get("sn"));
>>>          IndexableField theField = theDoc.getField("sn");
>>>          TokenStream theTokenStream = theField.tokenStream(theAnalyzer);
>>>          System.out.println(theTokenStream);
>>>
>>> produces the following output
>>>      SN345-B21
>>>      LowerCaseFilter@5f70bea5 term=sn345-b21,bytes=[73 6e 33 34 35 2d 62
>>> 32 31],startOffset=0,endOffset=9
>>>
>>> But the search does not work.  Anything obvious popping out for anyone?
>>>
>>>
>>> On Sat, Aug 9, 2014 at 4:39 PM, Milind <milindr@gmail.com> wrote:
>>>
>>>  I looked at a couple of examples on how to get keyword analyzer to be
>>>> case insensitive but I think I missed something since it's not working
>>>> for
>>>> me.
>>>>
>>>> In the code below, I'm indexing text in upper case and searching in
>>>> lower
>>>> case.  But I get back no hits.  Do I need to something more while
>>>> indexing?
>>>>
>>>>      private static class LowerCaseKeywordAnalyzer extends Analyzer
>>>>      {
>>>>          @Override
>>>>          protected TokenStreamComponents createComponents(String
>>>> theFieldName, Reader theReader)
>>>>          {
>>>>              KeywordTokenizer theTokenizer = new
>>>> KeywordTokenizer(theReader);
>>>>              TokenStreamComponents theTokenStreamComponents =
>>>>                  new TokenStreamComponents(
>>>>                          theTokenizer,
>>>>                          new LowerCaseFilter(Version.LUCENE_46,
>>>> theTokenizer));
>>>>              return theTokenStreamComponents;
>>>>          }
>>>>      }
>>>>
>>>>      private static void addDocment(IndexWriter theWriter,
>>>>                                        String theFieldName,
>>>>                                        String theValue,
>>>>                                        boolean storeTokenized)
>>>>          throws Exception
>>>>      {
>>>>            Document theDocument = new Document();
>>>>            FieldType theFieldType = new FieldType();
>>>>            theFieldType.setStored(true);
>>>>            theFieldType.setIndexed(true);
>>>>            theFieldType.setTokenized(storeTokenized);
>>>>            theDocument.add(new Field(theFieldName, theValue,
>>>> theFieldType));
>>>>            theWriter.addDocument(theDocument);
>>>>      }
>>>>
>>>>
>>>>      static void testLowerCaseKeywordAnalyzer()
>>>>          throws Exception
>>>>      {
>>>>          Version theVersion = Version.LUCENE_46;
>>>>          Directory theIndex = new RAMDirectory();
>>>>
>>>>          Analyzer theAnalyzer = new LowerCaseKeywordAnalyzer();
>>>>
>>>>          IndexWriterConfig theConfig = new IndexWriterConfig(theVersion,
>>>>
>>>>  theAnalyzer);
>>>>          IndexWriter theWriter = new IndexWriter(theIndex, theConfig);
>>>>          addDocment(theWriter, "sn", "SN345-B21", false);
>>>>          addDocment(theWriter, "sn", "SN445-B21", false);
>>>>          theWriter.close();
>>>>
>>>>          QueryParser theParser = new QueryParser(theVersion, "sn",
>>>> theAnalyzer);
>>>>          Query theQuery = theParser.parse("sn:sn345-b21");
>>>>          IndexReader theIndexReader = DirectoryReader.open(theIndex);
>>>>          IndexSearcher theSearcher = new IndexSearcher(theIndexReader);
>>>>          TopScoreDocCollector theCollector =
>>>> TopScoreDocCollector.create(10, true);
>>>>          theSearcher.search(theQuery, theCollector);
>>>>          ScoreDoc[] theHits = theCollector.topDocs().scoreDocs;
>>>>          System.out.println("Number of results found: " +
>>>> theHits.length);
>>>>      }
>>>>
>>>> --
>>>> Regards
>>>> Milind
>>>>
>>>>  --
>>> Regards
>>> Milind
>>>
>>>
>>
>
> --
> ------------------------------------------------------------------------
>
> Weil Individualität der beste Standard ist
>
> Dipl.-Inf. Christoph Kaser
>
> IconParc GmbH
> Sophienstraße 1
> 80333 München
>
> iconparc.de
>
> Tel: +49 - 89- 15 90 06 - 21
> Fax: +49 - 89- 15 90 06 - 19
>
> Geschäftsleitung: Dipl.-Ing. Roland Brückner, Dipl.-Inf. Sven Angerer. HRB
> 121830, Amtsgericht München
>
>


-- 
Regards
Milind

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message