lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rupinder Singh Mazara" <>
Subject RE: Null or no analyzer
Date Wed, 20 Oct 2004 15:57:44 GMT

the basic problem here is that there  are data source which contain
a) id, b) text c) title d) authors AND  d) subject heading

text, title and authors need to be tokenized

the subject heading can be one or more words,
anyone searching such datasource is expected to know the subject headings ,
if the user is trying to find all articles that have the phrases
"Jhon Kerry" and "Goerge Bush" as well as that are classified as "Election
it is possible that there are other documents that are classified as "Nation
Service Records"
or "Tax Returns" etc...

so the object is to find documents that have the above mentioned phrases as
well as one one
of the subject classifiers, so as to pull out the most meaning full

the subject classifiers pretain to domain knowledge, and it is possible that
2 or more
subject classification headings are composed of the same set of words, but
the sequence
in which they appear can drastically alter the meaning hence tokenizing the
subject field
is not exactly a healthy solution.

also such search tools are meant for people who know / understand  this
classification system
Taxonomy of animals can be taken as one such example,

hope this helps define the problem

>I still don't understand what is wrong with the Idea of indexing the
>title in a separate field and searching with a Phrase query
>+title:"Elections 2004" ?
>I think that the real problem is that the title is not tokenized and the
>title contains more then "Elections 2004"
>I think it is worthing to give a try to this solution.
>Or maybe I don't understand the problem correctly ...
>All the best,
> Sergiu
>>> Aviran
>>> -----Original Message-----
>>> From: Morus Walter []
>>> Sent: Wednesday, October 20, 2004 2:25 AM
>>> To: Lucene Users List
>>> Subject: RE: Null or no analyzer
>>> Aviran writes:
>>>> You can use WhiteSpaceAnalyzer
>>> Can he? If "Elections 2004" is one token in the subject field (keyword),
>>> this will fail, since WhiteSpeceAnalyzer will tokenize that to
>>> `Elections'
>>> and `2004'.
>>> So I guess he has to write an identity analyzer himself unless there
>>> is one
>>> provided (which doesn't seem to be the case). The only alternatives
>>> are not
>>> using query parser or extending query parser for a key word syntax,
>>> as far
>>> as I can see.
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail:
>>> For additional commands, e-mail:
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:
>To unsubscribe, e-mail:
>For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message