lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dirk Rothe" <d.ro...@semantics.de>
Subject migrating custom analyzer/tokenizer (3.6-> 6.x)
Date Thu, 08 Sep 2016 20:02:23 GMT
Hi,

I'm trying to migrate some Analyzers from API 3.6 to 6.2 and I'm not sure  
if I got the right approach. I'm using Pylucene, so lets assume this is  
pseudo-code.

In 3.x (and up to 4), I've had access to the StringReader containing the  
data in the overriden tokenStream(fieldName, reader):

class TokenStream3(PythonTokenStream):
     def __init__(self, reader):
         self.data = DATA_FROM_READER(reader)
         self.i = 0
         # prepare termAtt/offsetAtt/posIncrAtt and other helpers

     def incrementToken(self):
         if self.i == len(self.data):
             return False
         # stuff from self.data into termAtt/offsetAtt/posIncrAtt
         self.i += 1
         return True

class Analyzer3(PythonAnalyzer):
     def tokenStream(self, fieldName, reader):
         return TokenStream3(reader)
-----

In 5.x/6.x I've only found the following approach with some ugly  
indirections: Capture the active reader in Analyzer.initReader() and  
access it via callback in the Tokenizer.

class Tokenizer6(PythonTokenizer):
     def __init__(self, getReader):
         # callable for retrieving current reader
         self.getReader = getReader
         self.i = 0
         self.data = None

     def incrementToken(self):
         if self.i == 0:
             self.data = DATA_FROM_READER(self.getReader())
         if self.i == len(self.data):
             # we are reused - reset
             self.i = 0
             return False
         # stuff from self.data into termAtt/offsetAtt/posIncrAtt
         self.i += 1
         return True

class Analyzer6(PythonAnalyzer):
     def createComponents(self, fieldName):
          return Analyzer.TokenStreamComponents(Tokenizer6(lambda:  
self._reader))

     def initReader(self, fieldName, reader):
         # capture reader
         self._reader = reader
         return reader
-----

Is this sane?

--dirk

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message