asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with negative length
Date Thu, 03 Dec 2015 13:51:44 GMT
@Wenhai:

Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a
quick temporary fix. The general fix needs to move this tokenizer into
Asterix level so that it can properly recognize the NULL type tag so that
it can skip token generation process.

    @Override

    public void reset(byte[] sentenceData, int start, int length) {

        super.reset(sentenceData, start, length);

        gramNum = 0;


        int numChars = 0;

        int pos = byteIndex;

        int end = pos + sentenceUtf8Length;

        while (pos < end) {

            numChars++;

            pos += UTF8StringUtil.charSize(sentenceData, pos);

        }


        if (usePrePost) {

            totalGrams = numChars + gramLength - 1;

        } else {

            if (length >= gramLength) {

                totalGrams = numChars - gramLength + 1;

            } else {

                totalGrams = 0;

            }

        }

    }

Best,
Taewoo

On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <jira@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204
> ]
>
> Taewoo Kim commented on ASTERIXDB-1208:
> ---------------------------------------
>
> This error happens that the current tokenizer always assumes that it sees
> a UTF8 string. In this case, it sees a NULL value. We need to add a logic
> to bypass when a NULL value is provided.
>
> > ngram tokenizer failure with negative length
> > --------------------------------------------
> >
> >                 Key: ASTERIXDB-1208
> >                 URL:
> https://issues.apache.org/jira/browse/ASTERIXDB-1208
> >             Project: Apache AsterixDB
> >          Issue Type: Bug
> >          Components: Hyracks Core
> >            Reporter: Wenhai
> >            Assignee: Taewoo Kim
> >
> > drop dataverse test if exists;
> > create dataverse test;
> > use dataverse test;
> > create type DBLPOpenType as open {
> >   id: int64,
> >   dblpid: string,
> >   authors: string,
> >   misc: string
> > }
> > create dataset DBLPOpen(DBLPOpenType) primary key id;
> > insert into dataset DBLPOpen { "id": 93, "dblpid":
> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in
> NCĀ¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1
> db/journals/iandc/iandc90.html#IbarraJCR91" }
> > use dataverse test;
> > set import-private-functions 'true'
> > for $d in dataset DBLPOpen
> > where
> similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false))
> >= 0.5
> > return {"rec": $d}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message