asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: [jira] [Commented] (ASTERIXDB-1208) ngram tokenizer failure with negative length
Date Thu, 03 Dec 2015 19:04:06 GMT
Yes. All we need to do is changing one method. However, this is a temporary
fix. I will investigate more once I have more time.

On Thu, Dec 3, 2015 at 08:34 Chen Li <chenli@gmail.com> wrote:

> Thanks, Taewoo.  Do you think it's easier to apply these changes
> directly to Wenhai's "fuzzy branch"?
>
>
> On Thu, Dec 3, 2015 at 5:51 AM, Taewoo Kim <wangsaeu@gmail.com> wrote:
> > @Wenhai:
> >
> > Replace NGramUTF8StringBinaryTokenizer.reset() to the following code as a
> > quick temporary fix. The general fix needs to move this tokenizer into
> > Asterix level so that it can properly recognize the NULL type tag so that
> > it can skip token generation process.
> >
> >     @Override
> >
> >     public void reset(byte[] sentenceData, int start, int length) {
> >
> >         super.reset(sentenceData, start, length);
> >
> >         gramNum = 0;
> >
> >
> >         int numChars = 0;
> >
> >         int pos = byteIndex;
> >
> >         int end = pos + sentenceUtf8Length;
> >
> >         while (pos < end) {
> >
> >             numChars++;
> >
> >             pos += UTF8StringUtil.charSize(sentenceData, pos);
> >
> >         }
> >
> >
> >         if (usePrePost) {
> >
> >             totalGrams = numChars + gramLength - 1;
> >
> >         } else {
> >
> >             if (length >= gramLength) {
> >
> >                 totalGrams = numChars - gramLength + 1;
> >
> >             } else {
> >
> >                 totalGrams = 0;
> >
> >             }
> >
> >         }
> >
> >     }
> >
> > Best,
> > Taewoo
> >
> > On Tue, Dec 1, 2015 at 7:37 PM, Taewoo Kim (JIRA) <jira@apache.org>
> wrote:
> >
> >>
> >>     [
> >>
> https://issues.apache.org/jira/browse/ASTERIXDB-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15035204#comment-15035204
> >> ]
> >>
> >> Taewoo Kim commented on ASTERIXDB-1208:
> >> ---------------------------------------
> >>
> >> This error happens that the current tokenizer always assumes that it
> sees
> >> a UTF8 string. In this case, it sees a NULL value. We need to add a
> logic
> >> to bypass when a NULL value is provided.
> >>
> >> > ngram tokenizer failure with negative length
> >> > --------------------------------------------
> >> >
> >> >                 Key: ASTERIXDB-1208
> >> >                 URL:
> >> https://issues.apache.org/jira/browse/ASTERIXDB-1208
> >> >             Project: Apache AsterixDB
> >> >          Issue Type: Bug
> >> >          Components: Hyracks Core
> >> >            Reporter: Wenhai
> >> >            Assignee: Taewoo Kim
> >> >
> >> > drop dataverse test if exists;
> >> > create dataverse test;
> >> > use dataverse test;
> >> > create type DBLPOpenType as open {
> >> >   id: int64,
> >> >   dblpid: string,
> >> >   authors: string,
> >> >   misc: string
> >> > }
> >> > create dataset DBLPOpen(DBLPOpenType) primary key id;
> >> > insert into dataset DBLPOpen { "id": 93, "dblpid":
> >> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in
> >> NCĀ¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1
> >> db/journals/iandc/iandc90.html#IbarraJCR91" }
> >> > use dataverse test;
> >> > set import-private-functions 'true'
> >> > for $d in dataset DBLPOpen
> >> > where
> >>
> similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false))
> >> >= 0.5
> >> > return {"rec": $d}
> >>
> >>
> >>
> >> --
> >> This message was sent by Atlassian JIRA
> >> (v6.3.4#6332)
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message