lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2023) Improve performance of SmartChineseAnalyzer
Date Fri, 30 Oct 2009 19:29:59 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772050#action_12772050
] 

DM Smith commented on LUCENE-2023:
----------------------------------

Robert,
You have in BigramDictionary:
{code}
  public boolean isToExist(int to) {
    return to < tokenPairListTable.length && tokenPairListTable[to] != null;
  }
{code}
And you call it in:
{code}
  public void addSegTokenPair(SegTokenPair tokenPair) {
    final int to = tokenPair.to;
    if (!isToExist(to)) {
      ArrayList<SegTokenPair> newlist = new ArrayList<SegTokenPair>();
      newlist.add(tokenPair);
      tokenPairListTable[to] = newlist;
      tableSize++;
    } else {
      List<SegTokenPair> tokenPairList = tokenPairListTable[to];
      tokenPairList.add(tokenPair);
    }
  }
{code}

The check in addSegTokenPair assumes the isToExist(to) returns false when "to" is in bounds
because "tokenPairListTable[to]" will throw an array bounds exception otherwise. Is it an
invariant that tokenPair.to will always be in bounds?

In the same way the array in SegGraph, does the same thing.

With the former implementation, it did not have an issue.

Other than that, it looks good.

> Improve performance of SmartChineseAnalyzer
> -------------------------------------------
>
>                 Key: LUCENE-2023
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2023
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>            Priority: Minor
>             Fix For: 3.0
>
>         Attachments: LUCENE-2023.patch
>
>
> I've noticed SmartChineseAnalyzer is a bit slow, compared to say CJKAnalyzer on chinese
text.
> This patch improves the internal hhmm implementation. 
> Time to index my chinese corpus is 75% of the previous time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message