lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hill <>
Subject Fast way to get the start of document
Date Fri, 22 Jun 2012 19:23:41 GMT
Our Hit highlighting (Using the older Highlighter) is wired with a "too huge" limit, so we
could skip the multi-million character files, not just for highlighter.setMaxDocCharsToAnalyze,
but if a document is really above the too huge limit, we don't
even try, and just produce a fragment from the front of the document.  This results in almost
reasonable response to time, even for a result sets of crazy huge documents (or ones with
just 1 huge doc). I think this is all pretty normal.  Tell me if I'm wrong.

Given the above, while timing what was going on, I realized that I was reading in the entire
body of the text in the skip highlighting case just to grab the 1st 100 or so characters.
I was doing

String text = fieldable.stringValue(); // Oh my!

Is there a way to _not_ read the whole multi-million characters in and only _start_ reading
the contents of a large field?  See code below which got me no better results.
Some details

1.      Using Lucene 3.4

2.      Storing the (Tika) parse text of documents

a.      These are human produced documents; PDF, word etc. often 10K of characters, sometimes
100Ks, but very occasionally a few million)

3.      At this time, we store positions, but not offsets.

4.      We are using the old Highlighter, not the FastVectorHighlighter (because of #3 above).

5.      A basic search result is a page of 10 documents with short "blurb" (one fragment that
shows a good hit).

I would be willing to live with a token stream to gen the intro blurb, but using the following
code when under the too large code path (forget the highlighting) can add .5 seconds (compared
to not reading anything which is not a solution just a comparison).
So here is my code.
        Fieldable textFld = doc.getFieldable(TEXT);
        if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
            blurb = highlightBlurb(scoreDoc, document, textFld, workingBlurbLen);
        } else {
            logger.debug("----------- didn't call highlighter textLength = " + fullTextLength);
            TokenStream tokenStream = TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc,
TEXT, document, analyzer);
            OffsetAttribute offset = tokenStream.addAttribute(OffsetAttribute.class);
            CharTermAttribute charTerm = tokenStream.addAttribute(CharTermAttribute.class);
            StringBuilder blurbB = new StringBuilder("");
            while (tokenStream.incrementToken() && blurbB.length() < workingBlurbLen)
                blurbB.append(" ");
            blurb = blurbB.toString();
What could I do in the else that is faster?  Is not having offsets effecting this code path?
While your answering the above, I will be running some stats to suggest to management why
we SHOULD store offsets, so we can use FastVectorHighlighter,
but I'm afraid I might still want the too-huge-to-highlight path.


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message