lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe <tomasflo...@gmail.com>
Subject Re: Top 10 Terms in Index (by date)
Date Tue, 02 Apr 2013 13:16:49 GMT
Oh, I see, essentially you want to get the sum of the term frequencies for
every term in a subset of documents (instead of the document frequency as
the FacetComponent would give you). I don't know of an easy/out of the box
solution for this. I know the TermVectorComponent will give you the tf for
every term in a document, but I'm not sure if you can filter or sort on it.
Maybe you can do something like:
https://issues.apache.org/jira/browse/LUCENE-2393
or what's suggested here:
http://search-lucene.com/m/of5Fn1PUOHU/
but I have never used something like that.

Tomás



On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler <andy.pickler@gmail.com> wrote:

> I need "total number of occurrences" across all documents for each term.
> Imagine this...
>
> Post #1: "I think, therefore I am like you"
> Reply #1: "You think too much"
> Reply #2 "I think that I think much as you"
>
> Each of those "documents" are put into 'content'.  Pretending I don't have
> stop words, the top term query (not considering dateCreated in this
> example) would result in something like...
>
> "think": 4
> "I": 4
> "you": 3
> "much": 2
> ...
>
> Thus, just a "number of documents" approach doesn't work, because if a word
> occurs more than one time in a document it needs to be counted that many
> times.  That seemed to rule out faceting like you mentioned as well as the
> TermsComponent (which as I understand also only counts "documents").
>
> Thanks,
> Andy Pickler
>
> On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <
> tomasflobbe@gmail.com
> > wrote:
>
> > So you have one document per user comment? Why not use faceting plus
> > filtering on the "dateCreated" field? That would count "number of
> > documents" for each term (so, in your case, if a term is used twice in
> one
> > comment it would only count once). Is that what you are looking for?
> >
> > Tomás
> >
> >
> > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <andy.pickler@gmail.com>
> > wrote:
> >
> > > Our company has an application that is "Facebook-like" for usage by
> > > enterprise customers.  We'd like to do a report of "top 10 terms
> entered
> > by
> > > users over (some time period)".  With that in mind I'm using the
> > > DataImportHandler to put all the relevant data from our database into a
> > > Solr 'content' field:
> > >
> > > <field name="content" type="text_general" indexed="true" stored="false"
> > > multiValued="false" required="true" termVectors="true"/>
> > >
> > > Along with the content is the 'dateCreated' for that content:
> > >
> > > <field name="dateCreated" type="tdate" indexed="true" stored="false"
> > > multiValued="false" required="true"/>
> > >
> > > I'm struggling with the TermVectorComponent documentation to understand
> > how
> > > I can put together a query that answers the 'report' mentioned above.
> >  For
> > > each document I need each term counted however many times it is entered
> > > (content of "I think what I think" would report 'think' as used twice).
> > >  Does anyone have any insight as to whether I'm headed in the right
> > > direction and then what my query would be?
> > >
> > > Thanks,
> > > Andy Pickler
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message