lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: Term Frequency for phrases
Date Fri, 08 Jan 2010 18:08:30 GMT
What are the associated Analyzers for your Gene and Token?
Because if they're NOT something akin to KeywordAnalyzer, you
have a problem. Specifically, most of the "regular" tokenizers will
break this stream up into three separate terms,
"brain", "natriuetic", and "peptide". If that's the case,  there is no
single term in your index "brain natriuetic peptide".

I'm assuming that your high-level task is to answer "how many times
does the phrase 'brain natriuetic peptide' appear in the index (or maybe
doc)", right?

I really recommend that you get a copy of Luke and examine what's
actually in your index, it's invaluable.....

See Jason's e-mail for another approach....

HTH
Erick


On Fri, Jan 8, 2010 at 12:37 PM, hrishim <smarthrish@yahoo.co.in> wrote:

>
> @All : Elaborating the problem
>
> The phrase is being indexed as a single token ...
> I have a Gene tag in the xml document which is like
> <Gene>brain natriuretic peptide </Gene>
> This phrase is  present in the abstract text for the given document .
>
> Code is as :
>
> doc.add(new Field("Gene", geneName, Field.Store.YES,
> Field.Index.ANALYZED,Field.TermVector.YES));
>
> doc.add(new Field("Token", abstractText.toString().toLowerCase(),
> Field.Store.YES, Field.Index.ANALYZED,Field.TermVector.YES));
>
> When I retrieve all tokens as well as genes for a given doc and calculate
> the tf for each of these ,
> a null exception is thrown . Term = brain natriuretic peptide
>
> TermDocs termDocs = indexReader.termDocs(term);
> termDocs.next();
> double tf = termDocs.freq();
>
> Regards,
> Hrishi
>
>
> Grant Ingersoll-6 wrote:
> >
> > When do you detect that they are phrases?  During indexing or during
> > search?
> >
> > On Jan 8, 2010, at 5:16 AM, hrishim wrote:
> >
> >>
> >> Hi .
> >> I have phrases like brain natriuretic peptide indexed as a single token
> >> using Lucene.
> >> When I calculate the term frequency for the same  the count is 0 since
> >> the
> >> tokens from the text are indexed separately i.e. brain , natriuretic ,
> >> peptide.
> >> Is there a way to solve this problem and get the term frequency for the
> >> entire phrase ?
> >>
> >> Regards,
> >> Hrishi
> >> --
> >> View this message in context:
> >>
> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27073866.html
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > --------------------------
> > Grant Ingersoll
> > http://www.lucidimagination.com/
> >
> > Search the Lucene ecosystem using Solr/Lucene:
> > http://www.lucidimagination.com/search
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Term-Frequency-for-phrases-tp27073866p27079648.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message