lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Delgado" <jdelg...@lendingclub.com>
Subject Re: Progressive Query Relaxation
Date Fri, 11 May 2007 18:14:54 GMT
Hoss,

I never got to acknowledge your analisis. Well done. I do want to hear your
opinion about the following posting I sent to the list, which aims and
looking at the anolalogy between search engines and relational/XML databases
as the progress to evolve into a single type of retrieval system:

The ever growing presence of mingled structured and unstructured data is a
fact of life and modern systems we have to deal with. Clearly, the tendency
is that full-text indexing is moving towards DB functionality, i.e.
<attribute,value> fields for projection/filtering, sorting, faceted queries,
transactional CRUD operations etc. Though set manipulation is not Lucene's
or Solr's forte, the document-object model maps very well to rows of
relational sets or tables, evermore when CLOBs and TEXT fields where
introduced.

On the other hand, relational databases with XML and OO extensions and
native XML repositories still have to deal with the problem of RANKING
unstructured text and combination of text fragments and structured
conditions, thus  dealing no longer just with a set/relational model  that
yields binary answers but extending their query languages to handled the
concept of fuzziness, relevance, etc. ( e.g. SQL/MM, XQuery-FullText).

I would like once again to open this can of worms, and perhaps think out of
the box, without classifying DB and Full-Text as simply different, as we
analyze concepts to further understand the real path for evolution of
Lucene/Sorl

Here is a very interesting attempt to create a special type of "index"
called Domain Index to query unstructured data within Oracle by Marcelo
Ochoa:
https://issues.apache.org/jira/browse/LUCENE-724

Other interesting articles:

XQuery 1.0 - Full-Text:
http://www.w3.org/TR/xquery-full-text/
SQL/MM Full-Text
http://www.wiscorp.com/2CD1R1-02-fulltext-2001-12.pdf

Discussions on *XML data model vs. relational model*
http://www.xml.com/cs/user/view/cs_msg/2645

http://www.w3.org/TR/xpath-datamodel/
http://en.wikipedia.org/wiki/Relational_model


-- J.D.
2007/4/10, Chris Hostetter <hossman_lucene@fucit.org>:
>
>
> : Agreed, but best match is not ONLY about keywords. Here is where the
> : system developer can provide extra intelligence by doing query
> : re-writing.
>
> I finally got a chance to read through the URL (disclaimer: i do not have
> "a basic working knowledge of Oracle Text, such as the operators used in
> query expressions.")
>
> At it's core what is being described here can easily be done with a custom
> request handler that takes in a multivalue "q" param, and executes them in
> order until it finds some matches ... careful math when dealing start/rows
> and the number of results from each query make it easy to ensure that you
> can seemlessly return results from any/all queries in the order described
> (allthough you'd have to do something funky with the raw score values if
> you actually wanted to return them to the client)
>
> In general though, I agree with Walter ... this seems like a very naive
> approach.  At a very low conceptually level, The DisMaxRequestHandler does
> what the early counter example in the link talks about...
>
> >>  select book_id from books
> >>      where contains (author, '(michel crichton) OR (?michel ?crichton)
> >>      OR (michel OR crichton) OR (?michel OR ?crichton)
>
> the problem is that the two critisism of this appraoch (which may be valid
> in Oracle text matching) don't really apply in Solr/Lucene...
>
> >>   1.  From the user's point of view, hits which are a poor match will
> be
> >> mixed in with hits which are a good match. The user wants to see good
> >> matches displayed first.
>
> "poor" hits won't score as high as "good" hits -- boost
> values can be assigned for hte various pieces of the DisMax query so that
> exact phrase matches can be weighted better then individual word matches,
> coordFactors will ensure that docs only matching a few words don't score
> as well as docs matching all of the words, etc...
>
> >>   2. From the system's point of view, the search is inefficient. Even
> if
> >> there were plenty of hits for exactly "Michel Crichton", it would still
> >> have to do all the work of the fuzzy expansions and fetch data for all
> the
> >> rows which satisfy the query.
>
> My problem with this claim is the assumption that once you find lots of
> hits for "Michel Crichton" you don't need to keep looking for "Michel" or
> "Crichton" ... by this logic, many docs that contain the exact phrase
> "Michel Crichton" (and are roughly the same length) will get the same
> score, and the query will stop there ... the benefit of looking for
> 8everything* as a single query, is that the scores can become more fine
> grained -- docs with 1 exact match that *also* contain things like "Mr
> Crichton" several dozen times will score higher then docs with just that
> one exact match (cosider an article about "Michel Crichton" in which his
> full name appears only once vs an article listing popular authors, in
> which "Michel Crichton" appears exactly once)
>
> : Why do you say this? The rank is still provided by the search engine
> : BASED ON THE QUERY submitted and it does consider natural language
> : text. It's just leaving the order of execution in the hands of the
> : developer who knows better what the system should return for some
> : specific cases.
>
> evaluating each of the query parts in isolation and then aggregating the
> results doesn't take into account the *cumulative* value of the parts ...
> it's like averagine the ages of people in each city, then averaging those
> averages for each state and calling that the average age per state -- it's
> a much less accurate representation of reality then averaging the ages of
> everyone in the state all at once.
>
>
>
> -Hoss
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message