lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dedian Guo" <gded...@gmail.com>
Subject Re: controlled vocabulary
Date Fri, 25 Aug 2006 21:37:58 GMT
Hi, Xin, in my understanding , the document in Lucene is a term of
collection of fields, while a field is pair of keyword and value, tough it
can be indexed or stored or both. That is plain structure. if you wanna
index a deep tree structure such as complex objects and keep those
relationship inside, i guess we need do some tricky of that. so in my
mentioned solution, i will do something on the keyword of a document(here, a
document represent a object...) . the score problem you mentioned in your
question is similar, i mean, score is actually an attribute of mesh object,
so u wanna index the information which has a tree-like structure (i met the
similar problem when i indexing xml-based pages. esp. for those have lots of
deep element nodes, deep index needed for deep searching).

correct me if i was wrong or there are some better solutions...

On 8/25/06, Zhao, Xin <xzhao9@jhmi.edu> wrote:
>
> now. i have a second thought about one meah term per document. the scoring
> formula(hits too) is based on document, right? does it mean that we
> shouldn't have more than one document for each object indexed?
> for example, i try to index a publication,  for some of the information,
> like title, abstract i would like to store and index them using default
> similarity, while the other information i would like to use customized
> similarity. i probably should use a different indexing directory and
> writer
> instead of two documents in the same index, right?
> thank you for helping me. you could see that i am in the early learning
> stage now.
> xin
>
>
>
> ----- Original Message -----
> From: "Zhao, Xin" <xzhao9@jhmi.edu>
> To: <java-user@lucene.apache.org>
> Sent: Friday, August 25, 2006 10:21 AM
> Subject: Re: controlled vocabulary
>
>
> > Hi,
> > Thank you for your reply. I had thought about the first two solutions
> > before. If we apply one doc for each MeSH term, it would be 26 docs for
> > each item digested(we actually need the top 25 MeSH terms generated),
> > would it be any problem if there are too many documents? If we apply
> field
> > name like "mesh_1", "mesh_2"..., when it comes to search, we will have
> to
> > generate a loop for each single one of the query terms( there will be
> more
> > than 20-30 terms on average, since we are using sematic web to implement
> > concept search), do you think it would affect the performance in a very
> > bad way?
> > Regards,
> > Xin
> >
> >
> > ----- Original Message -----
> > From: "Dedian Guo" <gdedian@gmail.com>
> > To: <java-user@lucene.apache.org>; "Zhao, Xin" <xzhao@jhu.edu>
> > Sent: Thursday, August 24, 2006 4:22 PM
> > Subject: Re: controlled library
> >
> >
> >> in my solution, you can apply one doc for each mesh term, or apply
> >> different
> >> keyword such as "mesh_1"...."mesh_10" for your top 10 terms...or u can
> >> group
> >> your mesh terms as one string then add into a field, which requires a
> >> simple
> >> string parser for the group string when you wanna read the terms...
> >>
> >> not sure if that works or answers your question...
> >>
> >> On 8/24/06, Zhao, Xin <xzhao9@jhmi.edu> wrote:
> >>>
> >>> Hi,
> >>> I have a design question. Here is what we try to do for indexing:
> >>> We designed an indexing tool to generate standard MeSH terms from
> >>> medical
> >>> citations, and then use Lucene to save the terms and citations for
> >>> future
> >>> search. The information we need to save are:
> >>> a) the exact mesh terms (top 10)
> >>> b) the score for each term
> >>> so the codings are like
> >>> -----------------------------------
> >>> for the top 10 MeSH Terms
> >>> myField=Field.Keyword("mesh", mesh.toLowerCase());
> >>> myField.setBoost(score);
> >>> doc.add(myFiled);
> >>> end for
> >>> ------------------------------------
> >>> as you could see we generate all the terms under named field "mesh".
> If
> >>> I
> >>> understand correctly, all the fields under the same name would
> >>> eventually  save into one field, with all the scores be normalized
> into
> >>> filed boost. In this case, we wouldn't be able to save separate score,
> >>> so
> >>> the information is lost. Am I correct? Is there anyway we could change
> >>> it? I
> >>> understand Lucene is for keyword search, and what we try to do is
> >>> Controlled
> >>> Vocabulary search, Any other tool we could use?
> >>>
> >>> Thank you,
> >>> Xin
> >>>
> >>>
> >>>
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message