lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Tomlinson <>
Subject analyzer context during search
Date Wed, 11 Apr 2018 13:33:56 GMT

I’m working on a project where it would be most helpful for getWrappedAnalyzer() in an extension
to DelegatingAnalyzerWrapper to have access to more than just the fieldName.

The scenario is that we are working with several languages: Tibetan, Sanskrit and Chinese;
that each have several encodings, e.g., Simplified Chinese (zh-hans), Traditional Chinese
(zh-hant), Pinyin with diacritics (zh-pinyin) and Pinyin without diacritics (zh-pinyin-ndia).
Our data is from many sources which each use a variety of encodings and we wish to preserve
the original encodings used in the data.

For Chinese, for example, we have an analyzer that creates a TokenStream of Pinyin with diacritics
for any of the input encodings. Thus it is possible in some situations to retrieve documents
originally input as zh-hans and so on.

The same applies to the other languages.

One objective is to allow the user to input a query in zh-pinyin, for example, and to retrieve
documents that were originally indexed in any of the variant encodings.

The current scheme, in Apache Jena + Lucene, is to create a fieldName that includes the original
name plus a language tag, e.g., label_zh-hans, so that the getWrappedAnalyzer() can then retrieve
a registered analyzer for zh-hans that will then index using Pinyin tokens as mentioned above.

For Chinese, we end up with documents that have four different fields: label_zh-hans, label_zh-hant,
label_zh-pinyin, and label_zh-pinyin-ndia, so that when indexing we know what input encoding
was used so that an appropriate analyzer configuration can be chosen since the analyzer has
to be aware of the incoming encoding.

At search time we could try a search like:

    (label_zh-hans:a-query-in-pinyin OR label_zh-hant:a-query-in-pinyin OR label_zh-pinyin:a-query-in-pinyin
OR label_zh-pinyin-ndia:a-query-in-pinyin)

But this can not work since the information that the query is in zh-pinyin is not available
to the getWrappedAnalyzer(), only the original encoding is available as a part of the field
name and so it is not possible to know that the query string is in zh-pinyin so that is tokenized
correctly when querying the other fields.

I’m probably over-thinking things, but it seems to me that if I had a way of accessing additional
context when choosing an analyzer so that there would be information that the query string
is in pinyin and the various field names are available as usual.

I don’t see how a custom query analyzer would help here. We would know that the context
of the call to the analyzer wrapper was for query versus indexing but we still don’t know
the field name versus the encoding of the query.

I imagine this sort of scenario has been solved by others numerous times but I’m stumped
as to how to implement.

Thanks in advance for any help,

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message