lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gordin, Ira" <>
Subject How search code files for words which contains a given substrings?
Date Tue, 26 Jun 2018 07:57:42 GMT
Hi all,
I started to work on project which currently search code files for words which contains a
given substrings.
Currently it uses WhitespaceTokenizerand use regex query which wraps the searched substring
with '.*'.
For example, if one search for 'a', the query will be '/.*a.*/'. In this way in the 'Mama
loves banana' text, it will find tokens 'Mama' and 'banana'.
Currently I need to get the start and end positions of matched tokens in the line and the
line number.
With TokenStream I can get start and end positions of  'Mama' and 'banana' in the full text.
But I need the positions of 'a'.
I see 2 options.
Option 1: to perform additional search in returned token.
Option 2: to use NGramTokenizer or NGramTokenFilter (not sure which of them) and in this way
I hope I will get the 'a' positions in TokenStream.
Additional question how I can get the line numbers and the positions inside the line.
Many thanks in advance for your help,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message