lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Fast way to get the start of document
Date Sat, 23 Jun 2012 22:17:03 GMT
Simply have two fields, "full_body" and "limited_body". The former would 
index but not store the full document text from Tika (the "content" 
metadata.) The latter would store but not necessarily index the first 10K or 
so characters of the full text. Do searches on the full body field and 
highlighting on the limited body field.

-- Jack Krupansky

-----Original Message----- 
From: Paul Hill
Sent: Friday, June 22, 2012 2:23 PM
To: java-user@lucene.apache.org
Subject: Fast way to get the start of document

Our Hit highlighting (Using the older Highlighter) is wired with a "too 
huge" limit, so we could skip the multi-million character files, not just 
for highlighter.setMaxDocCharsToAnalyze, but if a document is really above 
the too huge limit, we don't
even try, and just produce a fragment from the front of the document.  This 
results in almost reasonable response to time, even for a result sets of 
crazy huge documents (or ones with just 1 huge doc). I think this is all 
pretty normal.  Tell me if I'm wrong.

Given the above, while timing what was going on, I realized that I was 
reading in the entire body of the text in the skip highlighting case just to 
grab the 1st 100 or so characters.
I was doing

String text = fieldable.stringValue(); // Oh my!

Is there a way to _not_ read the whole multi-million characters in and only 
_start_ reading the contents of a large field?  See code below which got me 
no better results.
Some details

1.      Using Lucene 3.4

2.      Storing the (Tika) parse text of documents

a.      These are human produced documents; PDF, word etc. often 10K of 
characters, sometimes 100Ks, but very occasionally a few million)

3.      At this time, we store positions, but not offsets.

4.      We are using the old Highlighter, not the FastVectorHighlighter 
(because of #3 above).

5.      A basic search result is a page of 10 documents with short "blurb" 
(one fragment that shows a good hit).

I would be willing to live with a token stream to gen the intro blurb, but 
using the following code when under the too large code path (forget the 
highlighting) can add .5 seconds (compared to not reading anything which is 
not a solution just a comparison).
So here is my code.
        Fieldable textFld = doc.getFieldable(TEXT);
        if ( fullTextLength <= EXTRA_LARGE_DOC_HIGHLIGHT_LIMIT ) {
            blurb = highlightBlurb(scoreDoc, document, textFld, 
workingBlurbLen);
        } else {
            logger.debug("----------- didn't call highlighter textLength = " 
+ fullTextLength);
            TokenStream tokenStream = 
TokenSources.getAnyTokenStream(indexReader, scoreDoc.doc, TEXT, document, 
analyzer);
            OffsetAttribute offset = 
tokenStream.addAttribute(OffsetAttribute.class);
            CharTermAttribute charTerm = 
tokenStream.addAttribute(CharTermAttribute.class);
            StringBuilder blurbB = new StringBuilder("");
            while (tokenStream.incrementToken() && blurbB.length() < 
workingBlurbLen) {
                blurbB.append(charTerm.toString());
                blurbB.append(" ");
            }
            blurb = blurbB.toString();
        }
What could I do in the else that is faster?  Is not having offsets effecting 
this code path?
While your answering the above, I will be running some stats to suggest to 
management why we SHOULD store offsets, so we can use FastVectorHighlighter,
but I'm afraid I might still want the too-huge-to-highlight path.

-Paul 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message