lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Binkley (JIRA)" <>
Subject [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".
Date Tue, 16 Oct 2007 16:44:50 GMT


Peter Binkley commented on SOLR-380:

The problem with the page-as-SorlDocument approach is that you then have to group the pages
back under their container documents to present a unified result to the user (like this:
). I want the primary unit of granularity in search results to be the book, and the pages
to be only a secondary layer. I also want to be able to do proximity searches that bridge
page boundaries, have relevance ranking consider the whole book text and not just that page,
etc.: i.e. treat the text as continuous for searching purposes. So I gain a lot by treating
the book as the SolrDocument; I just need that extra bit of work to resolve the page positions
to have it all.

> There's no way to convert search results into page-level hits of a "structured document".
> -----------------------------------------------------------------------------------------
>                 Key: SOLR-380
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: search
>            Reporter: Tricia Williams
>            Priority: Minor
> "Paged-Text" FieldType for Solr
> A chance to dig into the guts of Solr. The problem: If we index a monograph in Solr,
there's no way to convert search results into page-level hits. The solution: have a "paged-text"
fieldtype which keeps track of page divisions as it indexes, and reports page-level hits in
the search results.
> The input would contain page milestones: <page id="234"/>. As Solr processed the
tokens (using its standard tokenizers and filters), it would concurrently build a structural
map of the item, indicating which term position marked the beginning of which page: <page
id="234" firstterm="14324"/>. This map would be stored in an unindexed field in some efficient
> At search time, Solr would retrieve term positions for all hits that are returned in
the current request, and use the stored map to determine page ids for each term position.
The results would imitate the results for highlighting, something like:
> <lst name="pages">
>         <lst name="doc1">
>                 <int name="pageid">234</int>
>                 <int name="pageid">236</int>
>         </lst>
>         <lst name="doc2">
>                 <int name="pageid">19</int>
>         </lst>
> </lst>
> <lst name="hitpos">
>         <lst name="doc1">
>                 <lst name="234">
>                         <int name="pos">14325</int>
>                 </lst>
>         </lst>
>         ...
> </lst>

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message