jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Parvulescu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content
Date Mon, 19 Sep 2011 17:25:09 GMT

    [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108009#comment-13108009
] 

Alex Parvulescu commented on JCR-3075:
--------------------------------------

hmm, it would appear that LUCENE-2458 [0] has something to do with this. 
(it treats the 1 word made out of 3 character sequence as a phrase of 3 words made of 1 char
each).

I'm not sure what is the best way forward. Upgrading lucene to 3.1 maybe?
I don't think it's just a drop-in replacement, I tried and there are some errors on the repo
startup.

This brings the question about the lucene version that JR is using, which seems to be really
behind. We probably should try to update more often, but that is a different topic of discussion.


[0] https://issues.apache.org/jira/browse/LUCENE-2458


> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>
> The generated excerpt highlights single characters instead of full words. Test case (to
be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " +
jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message