lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Sokolov <soko...@ifactory.com>
Subject Re: Fast way to get the start of document
Date Sun, 24 Jun 2012 02:16:06 GMT
I got the sense from Paul's post that he wanted a solution that didn't 
require changing his index, although I'm not sure there is one.  Paul if 
you're willing to re-index, you could also store the length of the text 
as a numeric field, retrieve that and use it to drive the decision about 
whether to highlight.

-Mike Sokolov

On 6/23/2012 6:17 PM, Jack Krupansky wrote:
> Simply have two fields, "full_body" and "limited_body". The former 
> would index but not store the full document text from Tika (the 
> "content" metadata.) The latter would store but not necessarily index 
> the first 10K or so characters of the full text. Do searches on the 
> full body field and highlighting on the limited body field.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Paul Hill
> Sent: Friday, June 22, 2012 2:23 PM
> To: java-user@lucene.apache.org
> Subject: Fast way to get the start of document
>
> Our Hit highlighting (Using the older Highlighter) is wired with a 
> "too huge" limit, so we could skip the multi-million character files, 
> not just for highlighter.setMaxDocCharsToAnalyze, but if a document is 
> really above the too huge limit, we don't
> even try, and just produce a fragment from the front of the document.  
> This results in almost reasonable response to time, even for a result 
> sets of crazy huge documents (or ones with just 1 huge doc). I think 
> this is all pretty normal.  Tell me if I'm wrong.
>
> Given the above, while timing what was going on, I realized that I was 
> reading in the entire body of the text in the skip highlighting case 
> just to grab the 1st 100 or so characters.
> I was doing
>
> String text = fieldable.stringValue(); // Oh my!
>
> Is there a way to _not_ read the whole multi-million characters in and 
> only _start_ reading the contents of a large field?  See code below 
> which got me no better results.
> Some details
>
> 1.      Using Lucene 3.4
>
> 2.      Storing the (Tika) parse text of documents
>
> a.      These are human produced documents; PDF, word etc. often 10K 
> of characters, sometimes 100Ks, but very occasionally a few million)
>
> 3.      At this time, we store positions, but not offsets.
>
> 4.      We are using the old Highlighter, not the 
> FastVectorHighlighter (because of #3 above).
>
> 5.      A basic search result is a page of 10 documents with short 
> "blurb" (one fragment that shows a good hit).
>
> I would be willing to live with a token stream to gen the intro blurb, 
> but using the following code when under the too large code path 
> (forget the highlighting) can add .5 seconds (compared to not reading 
> anything which is not a solution just a comparison).
> So here is my code.
>        Fieldable textFld = doc.getFieldable(TEXT);
>        if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
>            blurb = highlightBlurb(scoreDoc, document, textFld, 
> workingBlurbLen);
>        } else {
>            logger.debug("----------- didn't call highlighter 
> textLength = " + fullTextLength);
>            TokenStream tokenStream = 
> TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT, 
> document, analyzer);
>            OffsetAttribute offset = 
> tokenStream.addAttribute(OffsetAttribute.class);
>            CharTermAttribute charTerm = 
> tokenStream.addAttribute(CharTermAttribute.class);
>            StringBuilder blurbB = new StringBuilder("");
>            while (tokenStream.incrementToken() && blurbB.length() < 
> workingBlurbLen) {
>                blurbB.append(charTerm.toString());
>                blurbB.append(" ");
>            }
>            blurb = blurbB.toString();
>        }
> What could I do in the else that is faster?  Is not having offsets 
> effecting this code path?
> While your answering the above, I will be running some stats to 
> suggest to management why we SHOULD store offsets, so we can use 
> FastVectorHighlighter,
> but I'm afraid I might still want the too-huge-to-highlight path.
>
> -Paul
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message