[jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".
Date Fri, 16 Jan 2009 17:39:02 GMT


Laurent Hoss commented on SOLR-380:

Hi Tricia
Looks nice, I've been searching for such a feature for years in lucene (and solr)!
But before getting too excited, i better try to ask the correct questions before doing a real
test .. as we don't even use solr yet (though I really want to :) 

In fact we currently have our home grown solution for similar problem:
In our case we want to restrain boolean searches to paragraphs or sentences of a document,
and implemented this (like many others) indexing extra docs for paragraphs etc. (with duplication
of many meta-data fields of the parent document)
Besides multiplying index size, the mapping from the found paragraphs to their base documents
involved a lot of custom coding.. and only recently we have at least implemented a fast counting
of the base docs for the found paragraph docs, by using a 'baseDocId'-FieldCache  (essentialy
a 'group by' In SQL lingo)

This leads to following requirements and questions:
* What is the performance of your PayloadComponent, compared to the standard SearchHandler?
We especially need very fast count(*) functionality, to dynamically compute statistics/charts
needing 100's of queries.
For this we just need the hitsCount of documents/paragraphs without the xpath payload info,
which would generate a really big XML response for 100K docs resultset!

* We want to find only documents where a (boolean) query matches within one of the paragraph_*
fields, and not if the query matches over the combined content of multiple paragraphs, as
discussed here:*-4-only-solution-(for-par-sen-and-case-sensitivity)-td13684315.html#a13685041
> The problem is that a search for sentence:foo AND sentence:bar is matching if foo matches
in any sentence of the paragraph, and bar also matches in any sentence of the paragraph. 

Do you think this is a good option for us?
ps: We should probably put up some Wiki page for this topic, after I've seen at least 10 people
asking for the possible solutions.. ok, maybe often with slightly different requirements!

One whole other way solving this would be using the SpanQuery package together with the nicelooking
Qsol (, allthough I'm not sure about its performance especially
with (really) long boolean queries!

