lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erick Erickson" <erickerick...@gmail.com>
Subject Re: [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".
Date Sat, 20 Oct 2007 17:47:37 GMT
As it happens, I put together an app that indexes books and
wrestled with this issue. NOTE: this was NOT solr, just straight
Lucene but I think my experience may apply. Also, I haven't
followed this entire thread, so I may be irrelevant...

First, the question of spanning across pages certainly was an
issue. Bumping the increment gap was unacceptable for that
reason. But a related issue was chapters where the increment
gap *was* required since it didn't make sense for us to have hits
that spanned chapters (or any other "major division" you want).

But we also had to figure out which page the hit *was* on. So I
added a token at the end of each page with  0 increment gap
from the last term on that page as my page marker. Then I
could take offset information from span queries and figure
out which page a hit was on. In our application, this was
important because what we return to the user is the image
of the book page. The text searched is actually OCR, so
the relation between the indexed data and the page image is...
well... not 100%....

We indexed all the page text for the book in one field as above.
One could also imagine indexing a field in the book with all the
page and offset information for faster access.

I have no idea how payloads figure in here, I've been off on
another project since these were introduced....

How to allow the query to NOT span the pages seems do-able
with the special end-of-page token, but we didn't have this
requirement so I haven't thought about it much....

Best
Erick (not to be confused with the eminent author of LIA)

On 10/19/07, Erik Hatcher <erik@ehatchersolutions.com> wrote:
>
>
> On Oct 18, 2007, at 11:53 AM, Binkley, Peter wrote:
> > I think the requirements I mentioned in a comment
> > (https://issues.apache.org/jira/browse/SOLR-380#action_12535296)
> > justify
> > abandoning the one-page-per-document approach. The increment-gap
> > approach would break the cross-page searching, and would involve about
> > as much work as the stored map, since the gap would have to vary
> > depending on the number of terms on each page, wouldn't it? (if there
> > are 100 terms on page one, the gap has to be 900 to get page two to
> > start at 1000 - or can you specify the absolute position you want
> > for a
> > term?).
>
> Yeah, one Solr document per page is not sufficient for this purpose.
>
> As for position increment gap and querying across page boundaries, I
> still think having all text in a single field is necessary, but to
> somehow separate pages such that whether a query can control whether
> it spans pages or not.  This could be accomplished trivially with a
> position increment gap.  The gap used only depends on the slop factor
> you need for phrase queries, not on the number of tokens per page.
> "quick fox"~10, for example - the default gap of 100, say, would
> prevent that query from matching across page boundaries.   I haven't
> thought this through thoroughly, so more thinking is needed here.
>
> > I think the problem of indexing books (or any text with arbitrary
> > subdivisions) is common enough that a generic approach like this would
> > be useful to more people than just me, and justifies some enhancements
> > within Solr to make the solution easy to reuse; but maybe when we've
> > figured out the best approach it will become clear how much of it is
> > worth packing into Solr.
>
> Most definitely this would be a VERY useful addition to Solr.  I know
> of several folks that are working with XTF (which uses a custom
> version of Lucene and other interesting data structures) to achieve
> this capability, but blending that sort of thing into Solr would make
> life a lot better for these projects.
>
> > (and just to clarify roles: Tricia's the one who'll actually be coding
> > this, if it's feasible; I'm just helping to think out requirements and
> > approaches based on a project in hand.)
>
> There is more to consider here.  Lucene now supports "payloads",
> additional metadata on terms that can be leveraged with custom
> queries.  I've not yet tinkered with them myself, but my
> understanding is that they would be useful (and in fact designed in
> part) for representing structured documents.  It would behoove us to
> investigate how payloads might be leveraged for your needs here, such
> that a single field could represent an entire document, with payloads
> representing the hierarchical structure.  This will require
> specialized Analyzer and Query subclasses be created to take
> advantage of payloads.  The Lucene community itself is just now
> starting to exploit this new feature, so there isn't a lot out there
> on it yet, but I think it holds great promise for these purposes.
>
>         Erik
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message