Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 59992 invoked from network); 8 Dec 2005 20:08:52 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 8 Dec 2005 20:08:52 -0000 Received: (qmail 62137 invoked by uid 500); 8 Dec 2005 20:08:46 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 62113 invoked by uid 500); 8 Dec 2005 20:08:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 62102 invoked by uid 99); 8 Dec 2005 20:08:45 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Dec 2005 12:08:45 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [69.55.225.129] (HELO ehatchersolutions.com) (69.55.225.129) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Dec 2005 12:08:44 -0800 Received: by ehatchersolutions.com (Postfix, from userid 504) id 049EB13E202A; Thu, 8 Dec 2005 15:08:21 -0500 (EST) Received: from [128.143.167.108] (d-128-167-108.bootp.Virginia.EDU [128.143.167.108]) by ehatchersolutions.com (Postfix) with ESMTP id EEE8813E202D for ; Thu, 8 Dec 2005 15:07:40 -0500 (EST) Mime-Version: 1.0 (Apple Message framework v746.2) In-Reply-To: <20051208155057.F3EE69ED04@mail-in-07.arcor-online.net> References: <20051208155057.F3EE69ED04@mail-in-07.arcor-online.net> Content-Type: text/plain; charset=ISO-8859-1; delsp=yes; format=flowed Message-Id: <182A9813-DA9E-45DB-9306-B3CFE9210DCD@ehatchersolutions.com> Content-Transfer-Encoding: quoted-printable From: Erik Hatcher Subject: Re: pdf and highlighting Date: Thu, 8 Dec 2005 15:07:38 -0500 To: java-user@lucene.apache.org X-Mailer: Apple Mail (2.746.2) X-Spam-Checker-Version: SpamAssassin 3.0.1 (2004-10-22) on javelina X-Spam-Level: X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=3.0.1 X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N On Dec 8, 2005, at 10:51 AM, Sonja L=F6hr wrote: > Thank you both, I found it > (I really asked a bit too early, sorry) > > The highlighter works correct if I use my custom Analyzer during =20 > indexing > (and for QueryParser), BUT > when preparing the TokenStream to feed the highlighter, I must NOT =20 > use it. > > TokenStream tStream =3D new GermanAnalyzer().tokenStream("body", new > StringReader(bodyText)); =09 > System.out.println( highlighter.getBestFragments(tStream, bodyText, =20= > 4, " > ..... ")); > > works, wheras > > TokenStream tStream =3D new GermanHtmlAnalyzer().tokenStream("body", = new > StringReader(bodyText)); =09 > System.out.println( highlighter.getBestFragments(tStream, bodyText, =20= > 4, " > ..... ")); > > gives rubbish highlighting. > > GermanHtmlAnalyzer feeds a normal GermanAnalyzer with a shortened =20 > String > (native characters) if the input contains decimal or html entities, =20= > but then > I'm totally confused why there is a problem with pdf text and not =20 > with HTML > text... The likely reason is that the token offsets fed to the highlighter =20 don't jive with the positions of the text in the text you're =20 highlighting. You're generating token offsets for strings that have =20 been replaced (and likely different sizes), but highlighting the =20 original text with the entities left intact. Maybe?? Erik --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org