lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergiu Gordea <gser...@ifit.uni-klu.ac.at>
Subject Re: Null or no analyzer
Date Wed, 20 Oct 2004 19:15:14 GMT
Rupinder Singh Mazara wrote:

>hi
>
>the basic problem here is that there  are data source which contain
>a) id, b) text c) title d) authors AND  d) subject heading
>  
>
>text, title and authors need to be tokenized
>
>the subject heading can be one or more words,
>  
>
the subject must be also tokennized, otherwise you cannot get any 
results that doesn't match the Term exaclty

 so ... for example, let's asume you have the folowing titles:
"George Trash Elections"
"George Trash"

if you search for "George Trash" and your title is not tokenized you 
will get just the second document (I hope I'm
not making any mistake when I say that, anyway it can be easily tested).

>anyone searching such datasource is expected to know the subject headings ,
>if the user is trying to find all articles that have the phrases
>"Jhon Kerry" and "Goerge Bush" as well as that are classified as "Election
>2004"
>it is possible that there are other documents that are classified as "Nation
>Service Records"
>or "Tax Returns" etc...
>  
>
how is there represented in the GUI as a select box? or input field?
if it is select box, if you have the concept of unique domain concept  
.. you can use a  a not tokenized string, or even a numerical
representation, but I think it is not your case.
In the case of input fields .. again I suggest you to tokenize the string

>so the object is to find documents that have the above mentioned phrases as
>well as one one
>of the subject classifiers, so as to pull out the most meaning full
>documents
>
>  
>
no problem ... once again .. use
+subject:"my searched subject"

>the subject classifiers pretain to domain knowledge, and it is possible that
>2 or more
>subject classification headings are composed of the same set of words, but
>the sequence
>in which they appear can drastically alter the meaning hence tokenizing the
>subject field
>is not exactly a healthy solution.
>  
>
the tokenization doesn't change the word order, in the case you use a 
PhraseQuery you will get the correct results

+title:"George Bush"
doesn't return documents with the title
"Bush George"

>also such search tools are meant for people who know / understand  this
>classification system
>  
>
:)) This is a general truth the the result are better when the people 
know what they are searching for :)

>Taxonomy of animals can be taken as one such example,
>
>hope this helps define the problem
>
>
>  
>
I cannot see anything special in your problem.
Before strating to implement a complex solution probably will be better 
to give it a chance to the simple one ...
I ensure you that you won't loose anything, and even if you decide to 
implement complex solutions you will have
a lot of reusable code.

 so ... Have fun,

  Sergiu

PS: if you can provide an example with a false positive please ... 
provide us the case


>
>
>  
>
>>I still don't understand what is wrong with the Idea of indexing the
>>title in a separate field and searching with a Phrase query
>>+title:"Elections 2004" ?
>>I think that the real problem is that the title is not tokenized and the
>>title contains more then "Elections 2004"
>>
>>I think it is worthing to give a try to this solution.
>>
>>Or maybe I don't understand the problem correctly ...
>>
>>All the best,
>>
>>Sergiu
>>
>>
>>
>>
>>
>>    
>>
>>>      
>>>
>>>>Aviran
>>>>http://aviran.mordos.com
>>>>
>>>>-----Original Message-----
>>>>From: Morus Walter [mailto:morus.walter@tanto.de]
>>>>Sent: Wednesday, October 20, 2004 2:25 AM
>>>>To: Lucene Users List
>>>>Subject: RE: Null or no analyzer
>>>>
>>>>
>>>>Aviran writes:
>>>>
>>>>        
>>>>
>>>>>You can use WhiteSpaceAnalyzer
>>>>>
>>>>>          
>>>>>
>>>>Can he? If "Elections 2004" is one token in the subject field (keyword),
>>>>this will fail, since WhiteSpeceAnalyzer will tokenize that to
>>>>`Elections'
>>>>and `2004'.
>>>>So I guess he has to write an identity analyzer himself unless there
>>>>is one
>>>>provided (which doesn't seem to be the case). The only alternatives
>>>>are not
>>>>using query parser or extending query parser for a key word syntax,
>>>>as far
>>>>as I can see.
>>>>
>>>>
>>>>
>>>>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>>        
>>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>>
>>>      
>>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>    
>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message