lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhao, Xin" <xzh...@jhmi.edu>
Subject Re: controlled vocabulary
Date Fri, 25 Aug 2006 15:49:26 GMT
Hi, Rupinder,
Our algorithm is a little different from what PubMed does. We have scoring 
for each mesh term, which will affect the search result.
What do you think the difference would be for these two:
document.addField(Field.Keyword("mesh", "xxxx"));
and
document.addField( new Field( "mesh", "xxxx", Field.Store.YES , 
Field.Index.TOKENIZED );

Thank you,
Xin



----- Original Message ----- 
From: "Rupinder Singh Mazara" <rmazara@masterfile.com>
To: <java-user@lucene.apache.org>
Sent: Friday, August 25, 2006 11:27 AM
Subject: Re: controlled vocabulary


> Hi Xin
>
>   then perhaps you can change it to Field.Index.TOKENIZED, but i was not 
> aware that pubmed boosts mesh terms, they broadly classify terms as major 
> and minor, if you plan to use this simple system of classification 
> consider adding the major terms twice to the document ?
>
> Zhao, Xin wrote:
>> Hi, Rupinder,
>> My understanding is Field.Index.NO_NORMS disables  index-time boosting 
>> and field length normalization at the same time. But I do need index-time 
>> boosting to store the scoring of each mesh term. Have I missed anything?
>> Thank you very much for your help,
>> Xin
>>
>> ----- Original Message ----- From: "Rupinder Singh Mazara" 
>> <rmazara@masterfile.com>
>> To: <java-user@lucene.apache.org>
>> Sent: Friday, August 25, 2006 10:49 AM
>> Subject: Re: controlled vocabulary
>>
>>
>>> hi Xin
>>>
>>>  this is take a look at this you can add multiple fields with the name 
>>> mesh
>>> for ( i=0; i< meshList.size() ; i++ ){
>>>    meshTerm = meshList.get(i)
>>>  document.addField( new Field( "mesh", meshTerm.semanticWebConceptId, 
>>> Field.Store.YES , Field.Index.NO_NORMS  );
>>> }
>>>
>>>  when querying this index, create a analyzer that infers the text string 
>>> and generates id's that correspond to the mesh term in the semantic web
>>>
>>>
>>>
>>> Zhao, Xin wrote:
>>>> Hi,
>>>> Thank you for your reply. I had thought about the first two solutions 
>>>> before. If we apply one doc for each MeSH term, it would be 26 docs for 
>>>> each item digested(we actually need the top 25 MeSH terms generated), 
>>>> would it be any problem if there are too many documents? If we apply 
>>>> field name like "mesh_1", "mesh_2"..., when it comes to search, we will 
>>>> have to generate a loop for each single one of the query terms( there 
>>>> will be more than 20-30 terms on average, since we are using sematic 
>>>> web to implement concept search), do you think it would affect the 
>>>> performance in a very bad way?
>>>> Regards,
>>>> Xin
>>>>
>>>>
>>>> ----- Original Message ----- From: "Dedian Guo" <gdedian@gmail.com>
>>>> To: <java-user@lucene.apache.org>; "Zhao, Xin" <xzhao@jhu.edu>
>>>> Sent: Thursday, August 24, 2006 4:22 PM
>>>> Subject: Re: controlled library
>>>>
>>>>
>>>>> in my solution, you can apply one doc for each mesh term, or apply 
>>>>> different
>>>>> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can

>>>>> group
>>>>> your mesh terms as one string then add into a field, which requires a

>>>>> simple
>>>>> string parser for the group string when you wanna read the terms...
>>>>>
>>>>> not sure if that works or answers your question...
>>>>>
>>>>> On 8/24/06, Zhao, Xin <xzhao9@jhmi.edu> wrote:
>>>>>>
>>>>>> Hi,
>>>>>> I have a design question. Here is what we try to do for indexing:
>>>>>> We designed an indexing tool to generate standard MeSH terms from

>>>>>> medical
>>>>>> citations, and then use Lucene to save the terms and citations for

>>>>>> future
>>>>>> search. The information we need to save are:
>>>>>> a) the exact mesh terms (top 10)
>>>>>> b) the score for each term
>>>>>> so the codings are like
>>>>>> -----------------------------------
>>>>>> for the top 10 MeSH Terms
>>>>>> myField=Field.Keyword("mesh", mesh.toLowerCase());
>>>>>> myField.setBoost(score);
>>>>>> doc.add(myFiled);
>>>>>> end for
>>>>>> ------------------------------------
>>>>>> as you could see we generate all the terms under named field "mesh".

>>>>>> If I
>>>>>> understand correctly, all the fields under the same name would
>>>>>> eventually  save into one field, with all the scores be normalized

>>>>>> into
>>>>>> filed boost. In this case, we wouldn't be able to save separate 
>>>>>> score, so
>>>>>> the information is lost. Am I correct? Is there anyway we could 
>>>>>> change it? I
>>>>>> understand Lucene is for keyword search, and what we try to do is

>>>>>> Controlled
>>>>>> Vocabulary search, Any other tool we could use?
>>>>>>
>>>>>> Thank you,
>>>>>> Xin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message