lenya-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Hartmann <andr...@apache.org>
Subject Re: Tag cloud
Date Fri, 03 Jul 2009 21:28:38 GMT
Andreas Hartmann schrieb:
> I volunteered to take the lead in the implementation of the "tag cloud" 
> feature (see [1]).

I added a first version to the contributions area. The tag cloud is 
visible in the defaultfiredocs publication.

> Some initial ideas:
> IMO it makes sense to use the Dublin Core element "subject" to assign 
> tags to a document [2].

I hard-coded this element for the moment.

> Definition: "The topic of the resource."
> Comment: "Typically, the subject will be represented using keywords, key 
> phrases, or classification codes. Recommended best practice is to use a 
> controlled vocabulary. To describe the spatial or temporal topic of the 
> resource, use the Coverage element."
> I guess this can be made configurable, we could just use the DC subject 
> as the default. Since tags can contain spaces, we should use multiple 
> meta data values to store multiple tags. A nice GUI for this has to be 
> implemented. Would it be sufficient to extend the standard meta data GUI 
> to allow entering multiple values, or do we need a dedicated tag 
> management GUI? I'd suggest to start with the existing meta data GUI.

I didn't take care of the multi-value handling yet. The tags are just 
the terms which are indexed by Lucene. TODO: Define the meta element 
values as keyword index fields instead of text fields to support phrases 
(multi-word terms).

> Finding all documents with a certain tag is rather simple since all meta 
> data are indexed.

I used the standard search for this purpose. I had to extend the lucene 
module sitemap with a "raw" query type. The query looks like this:


It's rather ugly that this appears in the search box, maybe we have to 
add a dedicated meta data search or use another concept for special 
search terms.

> The real challenge is to generate a list of all 
> existing tags.
> Maybe there is a performant way to generate the cloud using the index, 
> e.g. via a wildcard query. But this still needs some postprocessing, so 
> we'll probably have to cache the tag cloud.

Lucene allows to enumerate all terms for a particular field. To filter 
the language, I had to add a loop which searches the index for each 
term. I guess this takes quite a lot of time. Maybe someone knows a 
better solution? Or maybe a new version of Lucene has a more flexible 
API for term enumeration?

If you omit the language parameter of the IndexTermsGenerator, the 
language filtering is skipped and the listing of the terms is probably 
pretty fast.

> If Lucene doesn't help, we have another nifty feature for this purpose: 
> the RepositoryListener interface. By registering a listener with the 
> repository, we can extract the tags of a document when it is saved, and 
> update the tag cloud accordingly. The cloud also has to be updated when 
> a document is removed. The details are a bit tricky (concurrency, 
> queuing), but I think there's nothing that can't be solved. In this case 
> we have to store the tag cloud. My first idea would be to use a 
> dedicated document for this purpose.
> I'd prefer the dynamic generation using Lucene, though, because 
> otherwise we store redundant information in the repository which always 
> carries a certain risk.

I think we can use Lucene. No need for the repository listening.

> Another issue is supporting the user when she enters the tags. The 
> system should present a list of existing tags, possibly with some kind 
> of autocomplete functionality. But I guess when we manage to generate 
> the cloud, this feature can easily be added.

I didn't tackle this issue yet.

Any comments and improvements are greatly appreciated!

-- Andreas

Andreas Hartmann, CTO
BeCompany GmbH
Tel.: +41 (0) 43 818 57 01

To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org

View raw message