forrest-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ramon Prades" <rpra...@porcelanosa.com>
Subject RE: about lucent and exist
Date Thu, 11 Sep 2003 15:41:34 GMT
Hi Cheche

Please look at my comments below.

Regards.

Ramon

> -----Mensaje original-----
> De: Juan Jose Pablos [mailto:cheche@che-che.com] 
> Enviado el: jueves, 11 de septiembre de 2003 16:50
> Para: forrest-dev@xml.apache.org
> Asunto: about lucent and exist
> 
> 
> Hi,
> 
> I started looking at Ramon Padres bug. On the todo list I can see:
> 
>      - Improve ForrestIndexer: It should work with accented characters
>       ("a" and "รก" should be the same) and should reduce 
> indexes to their
>       roots (i.e. jump, jumper, jumping should all be the same index).

This is just improving the existing indexing algorithm. Should be very easy.

> 
> Which make me realize that lucene is a *text search engine*.

That's the main advantage about lucene: it's language independent. In fact,
Forrest isn't concerned at all about the input documents: you have to write
an indexer for each format you want to use, i.e. if you want to search in
Microsoft Word documents, you have to write a class to open and process
them.

> 
> We can fix issues related with the fact that lucene is not xml aware, 
> and help them with the testing, but I do not feel that it is an ideal 
> situation. Does anyone know if lucene is moving to a more xml 
> awareness?

No, Lucene is about searching all sorts of files (even binaries if you have
the indexer).

> 
> Should we look at exist instead?, I saw their demo[1] and it is very 
> much what the "semantic searching" is about isn't?

You can do the same with Lucene, it's all down to the Indexer. In mine, I
index forrest documents by mixing all the text. This is because I don't
think queries like "p:lucene" (read: "search all docs with word "lucene"
inside a "p" tag) are a good idea (specially for non-programmers).

Having said that, I think certain tags with a very strong meaning can be
used. For example "authors" and "title" (both working in my code): this can
be useful, specially if we have radio buttons for "search in authors only"
and "search in title only".

To finish my first version and have Lucene up and running in Forrest I
suggest doing the following:

- Index documents by asking Cocoon for the xml views. This will include
files like "todo" or "changes" in the searching scope.

- Improve the indexer to store a "normalized" version of the content
(replacing accented characters).

- Improve the search page by including radio buttons (to search in authors
and title).

- Add searching to static sites.

By having all this Forrest will have a very good searching engine: it's fast
and it's simple (and it's Apache).

I wanted to do all this a few weeks ago, but I've been awfully busy (who
isn't?). I plan to start again in 2 or 3 weeks.


> 
> Cheers,
> Cheche
> 
> [1] http://130.83.186.203/exist/simple/xquery.xsp
> 
> 
> 



Mime
View raw message