lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joaquin Delgado <joaquin.delg...@oracle.com>
Subject Re: "Advanced" query language
Date Tue, 20 Dec 2005 00:44:15 GMT
Comments in-line

Wolfgang Hoschek wrote:

> Yes, there are interesting impls out there. I've myself implemented  
> XQuery fulltext search via extension functions build on Lucene. See  
> http://dsd.lbl.gov/nux/index.html#Google-like%20realtime%20fulltext% 
> 20search%20via%20Apache%20Lucene%20engine
>
> However, rather than targetting fulltext search of infrequent queries  
> over huge persistent data archives (historic search), Nux targets  
> streaming fulltext search of huge numbers of queries over  
> comparatively small transient realtime data (prospective search),  
> e.g. 100000 queries/sec ballpark. Think XML router. That's probably  
> distinctly different than what many (most?) other folks would like to  
> do, and requires a different, somewhat non-standard, architecture.
>
> [The underlying lucene code lives in lucene SVN in the lucene/contrib/ 
> memory module, the remainder is in Nux.]
>
> Implementing XQuery in full compliance with the spec is a rather  
> gigantic undertaking. Separating the XQuery language and the fulltext  
> language greatly simplified the system design, and made it more  
> flexible and extensible.

[JOAQUIN] One of the arguable advantage of this new XQuery FT draft is 
that the semantics (http://www.w3.org/TR/xquery-full-text/#tq-semantics) 
are defined using XQuery  functions, thus it is relatively easy to build 
a "dumb" XQuery-FT compliant engine using these definitions :-)  Here is 
a Java based XQuery engine developed in Cornell that satisfies most of 
the working draft's requirements:
http://www.cs.cornell.edu/database/Quark/quark_main.html

> Further, consider that tulltext search capabilities are typically  
> quite open ended and context/application specific. Seems to me that  
> that's one of the reasons why lucene is more a set of interfaces and  
> diverse building blocks than a complete end user system. I find it  
> difficult to believe that making the fulltext language an *integral  
> part of XQuery* will enable sufficient "extension points" to prove  
> meaningful to end users and implementors. Standards evolve at a  
> glacial pace; it effectively means that most or all flexibility is  
> lost. I tend to think that the W3C is jumping the gun and attempting  
> to standardize what is more an R&D concept than a well understood set  
> of capabilities across a wide range of actual real world use cases,  
> and it does so in a non-modular manner.

Full-text search remains open ended and context/app specific thus it 
makes sense to leave Lucene as is and still have, for example Nutch. 
However the moment you are promoting INTEROPERABILITY with other 
search/retrieval systems by XMLizing the query input and the result 
output, like Mark is, then it makes sense to adhere to standards and the 
standard to query XML is XQuery. Because of the nature of the data (XML) 
full-text becomes a *must* requirement of the standard. If Mark comes up 
with yet another query language with some custom tags it would be 
denying the fact that search systems need to communicate among them and 
thus re-inventing the wheel. Besides, almost 80% of all full-text 
operators (Boolean, wildcards, proximity, etc.) just differ in syntax 
from one search engine to another. Just look at another "Common Query 
Language" now being used by the Library of Congress 
(http://www.loc.gov/standards/sru/cql/) for federated search.

Maybe I'm being too ambitious here but if we have an implementation of 
XQuery-FT compliant XQuery engine on top of Lucene indices or at the 
minimum _Lucene could interpret XPath queries_ where element node labels 
are  equivalent to Lucene fields we begin thinking of exposing Lucene 
sources to more sophisticated and distributed XQuery engines, thus 
providing full XML support on any Lucene based system. Unfortunately 
Lucene does not support nested fields but that is OK for now.

-- Joaquin

>
> On Dec 17, 2005, at 5:43 PM, JOAQUIN.DELGADO@ORACLE.COM wrote:
>
>> Paul and  Wolfang,
>>
>> Thank you very much for your input. I think there are two distinct  
>> problems that have emerged from this thread:
>> 1) The ability to create efficient structures to index and query  XML 
>> documents (element, attributes and corresponding values) with a  
>> full-text query language and perforators. After all XML is text. As  
>> Paul pointed out people have already tried this with Lucene.
>> 2) The need for a standard query language like XQuery aiming at  
>> system interoperability in the now XMLized world that has the same  
>> effect that SQL had in the relational world.
>>
>> While I can see how in the SQL case extension functions can be used  
>> to implement full-text capabilities, in the XML case full-text is  
>> required to query and retrieve XML (sub-document) elements and  
>> attributes  based on the free text (natural language) values AND  
>> also to query the strings that represent the structure itself. For  
>> example, in simple SQL queries the names of the tables and columns  
>> need to be known to project corresponding values and are not part  of 
>> the search conditions (in WHERE clauses only values  corresponding to 
>> table/columns are evaluated).
>>
>> In XQuery both the structure and the content are searchable, thus  
>> requiring full-text operators. That is why XQuery Full-Text  requires 
>> the unification and standardization both XQuery and Full- Text 
>> "languages". Needless is to say that the implementation will  differ 
>> from system to system.
>>
>> I do agree though that the abstraction of full-text capabilities  
>> through functional extensions is a great first step. Check out  
>> Oracle's XML Query Service (http://www.oracle.com/technology/tech/ 
>> xml/xds/index.html and , http://www.oracle.com/technology/oramag/ 
>> oracle/05-mar/o25xml.html)  a Java based XQuery engine that has  
>> abstracted "data sources"  such as Web Services, RDBMS, etc. as  
>> functions that while returning XML can receive parameters and  supply 
>> full-text capabilities. If Mark's implementation of Lucene  query and 
>> output in XML comes to fruition a Lucene data source will  become yet 
>> another stream of XML that can be queried, processed and  rendered by 
>> the mid-tier XQuery engine.
>>
>> -- Joaquin
>>
>>
>>
>> While maintaining my bookmarks I ran into this:
>> "Case Study: Enabling Low-Cost XML-Aware Searching
>> Capable of Complex Querying":
>> http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/ 
>> 03-02-08/03-02-08.html
>>
>> Some loose thoughts:
>>
>> In the system described there a Lucene document is used for each
>> low level xml construct, even when it contains very few characters  
>> of text.
>> The resulting Lucene indexes are at least 2.5 times the size of the
>> original document, which is not a surprise given this document  
>> structure.
>> Normal index size is about one third of  the indexed text.
>>
>> I don't know about the XQuery standard, but I was wondering
>> whether this unusual document structure and the non straightforward
>> fit between Lucene queries and XQuery queries are related.
>>
>> As for the  joines and iterations over items from the stream of XML
>> results: iteration over matching XML constructs should be no problem
>> in Lucene. Joins in Lucene are normally done via boolean filters,
>> so I was wondering how XQuery joins fit these.
>> The case study above has a note a the end of par 5.3:
>> "The Search Result list that comes back could then be organized
>> by document id to group together all the results for a single XML
>> document. This is not provided by default, but has been done with
>> extension to this code."
>>
>> Regards,
>> Paul Elschot
>>
>> On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:
>>
>>> I think implementing an XQuery Full-Text engine is far beyond the
>>> scope of Lucene.
>>>
>>> Implementing a building block for the fulltext aspect of it would be
>>> more manageable. Unfortunately The W3C fulltext drafts
>>> indiscriminately mix and mingle two completely different languages
>>> into a single language, without clear boundaries. That's why most
>>> practical folks implement XQuery fulltext search via extension
>>> functions rather than within XQuery itself. This also allows for much
>>> more detailed tokenization, configuration and extensibility than what
>>> would be possible with the W3C draft.
>>>
>>> Wolfgang.
>>>
>>> On Dec 15, 2005, at 4:20 PM, JOAQUIN.DELGADO@ORACLE.COM wrote:
>>>
>>>
>>>> Mark,
>>>>
>>>> This is very cool. When I was at TripleHop we did something very
>>>> similar where both query and results conformed to an XML Schema and
>>>> we used XML over HTTP as our main vehicle to do remote/federated
>>>> searches with quick rendering with stylesheets.
>>>>
>>>> That however is the first piece of the puzzle. If you really want
>>>> to go beyond search (in the traditional sense) and be able to
>>>> perform more complex operations such as joines and iterations over
>>>> items from the stream of XML results you are getting you should
>>>> consider implementing an XQuery Full-Text engine with Lucene
>>>> adopting the now standard XQuery language.
>>>>
>>>> Here is the pointer to the working draft on the W3C working draft
>>>> on XQuery 1.0 and XPath 2.0 Full-Text:
>>>> http://www.w3.org/TR/xquery-full-text/
>>>>
>>>> Now I'm part of the task force editing this draft so your comments
>>>> are very much welcomed.
>>>>
>>>> -- J.D.
>>>>
>>>>
>>>> http://www.inperspective.com/lucene/LXQueryV0_1.zip
>>>>
>>>> I've implemented just a few queries (Boolean, Term, FilteredQuery,
>>>> BoostingQuery ...) but other queries are fairly trivial to add.
>>>> At this stage I am more interested in feedback on parser design/
>>>> approach
>>>> rather than trying to achieve complete coverage of all the Lucene
>>>> Query
>>>> types or debating the choice of tag names.
>>>>
>>>> Please see the readme.txt in the package for more details.
>>>>
>>>> Cheers
>>>> Mark
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>> While maintaining my bookmarks I ran into this:
>> "Case Study: Enabling Low-Cost XML-Aware Searching
>> Capable of Complex Querying":
>> http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/ 
>> 03-02-08/03-02-08.html
>>
>> Some loose thoughts:
>>
>> In the system described there a Lucene document is used for each
>> low level xml construct, even when it contains very few characters  
>> of text.
>> The resulting Lucene indexes are at least 2.5 times the size of the
>> original document, which is not a surprise given this document  
>> structure.
>> Normal index size is about one third of  the indexed text.
>>
>> I don't know about the XQuery standard, but I was wondering
>> whether this unusual document structure and the non straightforward
>> fit between Lucene queries and XQuery queries are related.
>>
>> As for the  joines and iterations over items from the stream of XML
>> results: iteration over matching XML constructs should be no problem
>> in Lucene. Joins in Lucene are normally done via boolean filters,
>> so I was wondering how XQuery joins fit these.
>> The case study above has a note a the end of par 5.3:
>> "The Search Result list that comes back could then be organized
>> by document id to group together all the results for a single XML
>> document. This is not provided by default, but has been done with
>> extension to this code."
>>
>> Regards,
>> Paul Elschot
>>
>> On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:
>>
>>> I think implementing an XQuery Full-Text engine is far beyond the
>>> scope of Lucene.
>>>
>>> Implementing a building block for the fulltext aspect of it would be
>>> more manageable. Unfortunately The W3C fulltext drafts
>>> indiscriminately mix and mingle two completely different languages
>>> into a single language, without clear boundaries. That's why most
>>> practical folks implement XQuery fulltext search via extension
>>> functions rather than within XQuery itself. This also allows for much
>>> more detailed tokenization, configuration and extensibility than what
>>> would be possible with the W3C draft.
>>>
>>> Wolfgang.
>>>
>>> On Dec 15, 2005, at 4:20 PM, JOAQUIN.DELGADO@ORACLE.COM wrote:
>>>
>>>
>>>> Mark,
>>>>
>>>> This is very cool. When I was at TripleHop we did something very
>>>> similar where both query and results conformed to an XML Schema and
>>>> we used XML over HTTP as our main vehicle to do remote/federated
>>>> searches with quick rendering with stylesheets.
>>>>
>>>> That however is the first piece of the puzzle. If you really want
>>>> to go beyond search (in the traditional sense) and be able to
>>>> perform more complex operations such as joines and iterations over
>>>> items from the stream of XML results you are getting you should
>>>> consider implementing an XQuery Full-Text engine with Lucene
>>>> adopting the now standard XQuery language.
>>>>
>>>> Here is the pointer to the working draft on the W3C working draft
>>>> on XQuery 1.0 and XPath 2.0 Full-Text:
>>>> http://www.w3.org/TR/xquery-full-text/
>>>>
>>>> Now I'm part of the task force editing this draft so your comments
>>>> are very much welcomed.
>>>>
>>>> -- J.D.
>>>>
>>>>
>>>> http://www.inperspective.com/lucene/LXQueryV0_1.zip
>>>>
>>>> I've implemented just a few queries (Boolean, Term, FilteredQuery,
>>>> BoostingQuery ...) but other queries are fairly trivial to add.
>>>> At this stage I am more interested in feedback on parser design/
>>>> approach
>>>> rather than trying to achieve complete coverage of all the Lucene
>>>> Query
>>>> types or debating the choice of tag names.
>>>>
>>>> Please see the readme.txt in the package for more details.
>>>>
>>>> Cheers
>>>> Mark
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message