lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <>
Subject Re: Creating tag clouds with lucene
Date Fri, 06 Nov 2009 08:39:43 GMT
On Fri, Nov 6, 2009 at 12:25 AM, Mathias Bank <>wrote:

> Well, it could be a facet search, if there would be tags available but
> if you just wanna have a "tag cloud" generated by full-text, I don't
> see how a facet search could help to generate this cloud.
> Unfortunatelly, I don't have tags in my data. What I need is the
> information, what are the most used terms (or multi terms) in this
> data. First I have thought of using carrot2, which uses a specialed
> clustering algorithm. But I have wondered, if it is not possible to
> get the most used terms out of lucene directly.

It is a facet search because if you take the field you want the cloud for
(I called it the tags field, but it can be any field - a full-text "body"
for example),
and then set up a multi-valued facet on that field - this will return the
of the number of documents matching your given query which contain each
of the given terms (one integer count per term).  Sorting by count
and picking the top-N is what you normally do in a facet search, and then
you use the counts themselves to decide how big to make each term.

For a 10million document single index, if your field has a lot of unique
and you do nothing to prune it down, this kind of query could be expensive,
yes.   But you'll want to prune down full-text anyways, or else your cloud
have whatever words are just uncommon enough to not be stop-words (if
you're using a stoplist), or of course the common stop list itself (if you
aren't).  This won't be very informative - you want the terms which are
most descriptive *of that query*, which is why I suggested doing a modified
facet query, where you normalize by the docFreq of the term as you
count, which effectively gets the amount of over/under-representation of
term in the documents matching your query-filter.


> Glen has mentioned, that he is doing this for full-text data. He
> mentioned that he is using the IndexReader.termDocs(Term term) method.
> So I think he iterates all terms and looks in how many documents this
> term exists. But what I don't see is: how does this method work with a
> filter? Do you first look for all documents which are valid for the
> used filter and than iterate all terms only counting documents in this
> filtered set? I cannot imagine, that this is performant because I have
> more than 10 mio documents (fast growing).
> Mathias
> 2009/11/6 Chris Lu <>:
> > Isn't the tag cloud just another facet search? Only difference is the tag
> is
> > multi-valued.
> >
> > Basically just go through the search results and find all unique tag
> values.
> >
> > --
> > Chris Lu
> > -------------------------
> > Instant Scalable Full-Text Search On Any Database/Application
> > site:
> > demo:
> > Lucene Database Search in 3 minutes:
> >
> > DBSight customer, a shopping comparison site, (anonymous per request) got
> > 2.6 Million Euro funding!
> >
> >
> > Mathias Bank wrote:
> >>
> >> Hi,
> >>
> >> I want to calculate a tag cload for search results. I have seen, that
> >> it is possible to extract the top 20 words out of the lucene index. Is
> >> there also a possibility to extract the top 20 words out of search
> >> results (or filter results) in lucene?
> >>
> >> Mathias
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail:
> >> For additional commands, e-mail:
> >>
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail:
> > For additional commands, e-mail:
> >
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message