forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <>
Subject Re: about lucent and exist
Date Sat, 13 Sep 2003 08:54:07 GMT

On Friday, Sep 12, 2003, at 02:36 Europe/Rome, Juan Jose Pablos wrote:

> Ramon Prades wrote:
>>> Which make me realize that lucene is a *text search engine*.
>> That's the main advantage about lucene: it's language independent. In 
>> fact,
>> Forrest isn't concerned at all about the input documents: you have to 
>> write
>> an indexer for each format you want to use, i.e. if you want to 
>> search in
>> Microsoft Word documents, you have to write a class to open and 
>> process
>> them.
> I am not worry about fixing just one issue. Being XML aware means that 
> you can do a:
> (after using forms to create this Xpath query)
> //faqs/part/id['general']/faq/question[containts(.,'xsl')]
> So you would search for "xsl" within a collection of FAQ XML documents 
> that have a faq part called 'general'
> I am not sure how dificult is to get there with lucene, but exist 
> seems to get it already.

Lucene is based on algorithms that don't allow the above.

For that, you need what is called an "xml database", which could be, in 
the most simple case, a collection of files in a file system and a very 
slow incremental collector that opens all files, scans them and 
collects the matching elements and returns the results as a new 
document. In the best case, it's a semi-structured database with 
multidimensional indexing features (exist and xindice are much closer 
to that).

take a look at JSR 170 for another possibility (it includes a SQL-like 
query language for hierarchies of nodes)

>> You can do the same with Lucene, it's all down to the Indexer. In 
>> mine, I
>> index forrest documents by mixing all the text. This is because I 
>> don't
>> think queries like "p:lucene" (read: "search all docs with word 
>> "lucene"
>> inside a "p" tag) are a good idea (specially for non-programmers).
> I do not think that users should deal with that, for them that 
> language is hidden.

You are trying to create "virtual documents" out of XML-aware queries 
over a repository of hierarchical content (not necessarely XML, but 

Forget Lucene, it's not the right tool and not the right direction.

>> Having said that, I think certain tags with a very strong meaning can 
>> be
>> used. For example "authors" and "title" (both working in my code): 
>> this can
>> be useful, specially if we have radio buttons for "search in authors 
>> only"
>> and "search in title only".
> Semantics searching ( I thought about something similar before I knew 
> the name) is about using tags to limited the search and get better 
> results.

Eh, if it was that easy. You are implying that:

  1) a tag is used to indicate the semantics of the nodes contained 
therein. Although this is generally the case (and there is the ability 
to have RDF/XML to performm this way) this is not generalizable.

  2) without namespaces, there is a tremendous semantic collision. With 
namespaces, you are assuming that the namespace refers to the 'meaning' 
of the tag, again not generalizable.

This said, I agree that having the ability to run XQuery queries over a 
content repository that exposes XML views would be a tremendous help. 
Just don't call it "semantic searching", because that's not even close 
(but very few are able to explain the difference and the reason why we 
need the entire RDF stack in the first place, so don't worry).


View raw message