jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alex Parvulescu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (JCR-3075) incorrect HTML excerpt generation for queries on japanese text content
Date Wed, 21 Sep 2011 12:13:08 GMT

     [ https://issues.apache.org/jira/browse/JCR-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Alex Parvulescu updated JCR-3075:
---------------------------------

    Attachment: JCR-3075.patch

this problem also affects excerpt generation for any quoted phrase search, not just japanese.
Normally a quoted phrase should be considered as only one item when building the excerpt.

Now, because of LUCENE-2458, a normal search using japanese turns into the equivalent of quoted
search in let's say english.
So because the excerpt generator has issues dealing with phrases, then any japanese search
would have each character of the search token highlighted, instead of just one highlight containing
the whole word.

The patch should fix both the original issue, and highlighting for any quoted search.
The problem is there is one test failing and I'm not sure why :(

The failing test is ExcerptTest#testEncodeIllegalCharsNoHighlights, which apparently fails
because there is more info on the node returned from the search than expected.
This should not happen, as I haven't touched that part of the code (node indexing), but sadly
it does so I still need to investigate.

I'd also welcome some feedback on the approach.

> incorrect HTML excerpt generation for queries on japanese text content 
> -----------------------------------------------------------------------
>
>                 Key: JCR-3075
>                 URL: https://issues.apache.org/jira/browse/JCR-3075
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>            Reporter: Julian Reschke
>            Priority: Minor
>         Attachments: JCR-3075.patch
>
>
> The generated excerpt highlights single characters instead of full words. Test case (to
be added to FullTextQueryTest):
>      public void testJapaneseAndHighlight() throws RepositoryException {
>         // http://translate.google.com/#auto|en|%E3%82%B3%E3%83%B3%E3%83%86%E3%83%B3%E3%83%88
>         String jContent = "\u30b3\u30fe\u30c6\u30f3\u30c8";
>         // http://translate.google.com/#auto|en|%E3%83%86%E3%82%B9%E3%83%88
>         String jTest = "\u30c6\u30b9\u30c8";
>         
>         String content = "some text with japanese: " + jContent
>                 + " ('content')" + " and " + jTest + " ('test').";
>         // expected excerpt; note this may change if excerpt providers change
>         String expectedExcerpt = "<div><span>some text with japanese: " +
jContent
>                 + " ('content') and <strong>" + jTest
>                 + "</strong> ('test').</span></div>";
>         
>         Node n = testRootNode.addNode("node1");
>         n.setProperty("title", content);
>         testRootNode.getSession().save();
>         
>         String xpath = "/jcr:root" + testRoot + "/element(*, nt:unstructured)"
>                 + "[jcr:contains(., '" + jTest + "')]/rep:excerpt(.)";
>         Query q = superuser.getWorkspace().getQueryManager()
>                 .createQuery(xpath, Query.XPATH);
>         
>         QueryResult qr = q.execute();
>         RowIterator it = qr.getRows();
>         int cnt = 0;
>         while (it.hasNext()) {
>             cnt++;
>             Row found = it.nextRow();
>             assertEquals(n.getPath(), found.getPath());
>             String excerpt = found.getValue("rep:excerpt(.)").getString();
>             assertEquals(expectedExcerpt, excerpt);
>         }
>         
>         assertEquals(1, cnt);
>     }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message