incubator-ctakes-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CTAKES-63) exception formed by malformed email address
Date Thu, 04 Oct 2012 13:11:08 GMT

    [ https://issues.apache.org/jira/browse/CTAKES-63?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13469355#comment-13469355
] 

Sean commented on CTAKES-63:
----------------------------

As Pei had indicated to an email that was forwarded to me and I'm including here for documentation
purposes (my response follows):

> After some debugging, this happens when the token contains a dash (-), 
> and contains a special char such as the right bracket].
> //I believe all of the chars in the QueryParser str token should be 
> escaped to avoid issues such as a token ending with ']'
> 
> Before we add and test the proposed fixed (add escape() call) such as 
> below, I also noticed another potential issue: we do search and 
> replace of all dashes into spaces.  Just wanted to ensure that this 
> was done intentionally and works fine because the dashes have already 
> been removed in the index.  Otherwise, we'll need to actually replace 
> the dash with a '?' instead of a space or use a phrasequery instead of 
> termquery.  Would be great if someone familiar with this bit of code to confirm...
> 
> LuceneDictionaryImpl.java (dictionary-lookup) [~Line 106]
> 
>               if (str.indexOf('-') == -1) {
>                      q = new TermQuery(new Term(iv_lookupFieldName, str));
>                      topDoc = iv_searcher.search(q, iv_maxHits);
>               }
>               else {  // needed the KeyworkAnalyzer for situations 
> where the hypen was included in the f-word
>                      QueryParser query = new 
> QueryParser(Version.LUCENE_30, iv_lookupFieldName, new KeywordAnalyzer());
>                      try {
>                            //topDoc =
> iv_searcher.search(query.parse(str.replace('-', ' ')), iv_maxHits);
>                            //proposed fixed
>                             String escaped = 
> QueryParser.escape(str.replace('-', ' '));
>                             topDoc =
> iv_searcher.search(query.parse(escaped), iv_maxHits);
>                            } catch (ParseException e) {
>                                   // TODO Auto-generated catch block
>                                   e.printStackTrace();
>                            }
>               }

I was the author of the code in question above.  Prior versions of cTAKES utilized dictionary
resources that required this work around for situations when a  hyphen was contained in the
first term (f-word) being looked up.  Part of the issue was the fact that hyphenated terms
would be handled as single tokens, however, this problem had more to do with how the Lucene
dictionary was built than the content of the dictionary.  

After some experimentation I discovered that how the field was indexed played a role in what
would be able to be queried within the string.  By using the following I achieved better results:

					document.add(new Field("first_word", s[0].trim(), Field.Store.YES,
							Field.Index.ANALYZED));

                
> exception formed by malformed email address
> -------------------------------------------
>
>                 Key: CTAKES-63
>                 URL: https://issues.apache.org/jira/browse/CTAKES-63
>             Project: cTAKES
>          Issue Type: Bug
>          Components: ctakes-dictionary-lookup
>    Affects Versions: 2.6-incubating
>         Environment: windows
>            Reporter: Chen Lin
>            Priority: Critical
>              Labels: Stability
>
> 2012-09-21 12:48:36,789 INFO  edu.mayo.bmi.uima.lookup.ae.UmlsDictionaryLookupAnnotator
 - process(JCas)
> org.apache.lucene.queryParser.ParseException: Cannot parse 'mailto:abcoman@t nec.org]':
Lexical error at line 1, column 26.  Encountered: <EOF> after : ""
>        at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:192)
>        at edu.mayo.bmi.dictionary.lucene.LuceneDictionaryImpl.getEntries(LuceneDictionaryImpl.java:106)
>        at edu.mayo.bmi.dictionary.DictionaryEngine.metaLookup(DictionaryEngine.java:181)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message