Andreas Hartmann schrieb:
> I volunteered to take the lead in the implementation of the "tag cloud"
> feature (see [1]).
I added a first version to the contributions area. The tag cloud is
visible in the defaultfiredocs publication.
> Some initial ideas:
>
> IMO it makes sense to use the Dublin Core element "subject" to assign
> tags to a document [2].
I hard-coded this element for the moment.
> Definition: "The topic of the resource."
> Comment: "Typically, the subject will be represented using keywords, key
> phrases, or classification codes. Recommended best practice is to use a
> controlled vocabulary. To describe the spatial or temporal topic of the
> resource, use the Coverage element."
>
> I guess this can be made configurable, we could just use the DC subject
> as the default. Since tags can contain spaces, we should use multiple
> meta data values to store multiple tags. A nice GUI for this has to be
> implemented. Would it be sufficient to extend the standard meta data GUI
> to allow entering multiple values, or do we need a dedicated tag
> management GUI? I'd suggest to start with the existing meta data GUI.
I didn't take care of the multi-value handling yet. The tags are just
the terms which are indexed by Lucene. TODO: Define the meta element
values as keyword index fields instead of text fields to support phrases
(multi-word terms).
> Finding all documents with a certain tag is rather simple since all meta
> data are indexed.
I used the standard search for this purpose. I had to extend the lucene
module sitemap with a "raw" query type. The query looks like this:
\{http\://purl.org/dc/elements/1.1/\}subject:foobar
It's rather ugly that this appears in the search box, maybe we have to
add a dedicated meta data search or use another concept for special
search terms.
> The real challenge is to generate a list of all
> existing tags.
>
> Maybe there is a performant way to generate the cloud using the index,
> e.g. via a wildcard query. But this still needs some postprocessing, so
> we'll probably have to cache the tag cloud.
Lucene allows to enumerate all terms for a particular field. To filter
the language, I had to add a loop which searches the index for each
term. I guess this takes quite a lot of time. Maybe someone knows a
better solution? Or maybe a new version of Lucene has a more flexible
API for term enumeration?
If you omit the language parameter of the IndexTermsGenerator, the
language filtering is skipped and the listing of the terms is probably
pretty fast.
> If Lucene doesn't help, we have another nifty feature for this purpose:
> the RepositoryListener interface. By registering a listener with the
> repository, we can extract the tags of a document when it is saved, and
> update the tag cloud accordingly. The cloud also has to be updated when
> a document is removed. The details are a bit tricky (concurrency,
> queuing), but I think there's nothing that can't be solved. In this case
> we have to store the tag cloud. My first idea would be to use a
> dedicated document for this purpose.
>
> I'd prefer the dynamic generation using Lucene, though, because
> otherwise we store redundant information in the repository which always
> carries a certain risk.
I think we can use Lucene. No need for the repository listening.
> Another issue is supporting the user when she enters the tags. The
> system should present a list of existing tags, possibly with some kind
> of autocomplete functionality. But I guess when we manage to generate
> the cloud, this feature can easily be added.
I didn't tackle this issue yet.
Any comments and improvements are greatly appreciated!
-- Andreas
--
Andreas Hartmann, CTO
BeCompany GmbH
http://www.becompany.ch
Tel.: +41 (0) 43 818 57 01
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lenya.apache.org
For additional commands, e-mail: dev-help@lenya.apache.org
|