asterixdb-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Taewoo Kim <wangs...@gmail.com>
Subject Re: [jira] [Created] (ASTERIXDB-1208) ngram tokenizer failure with negative length
Date Wed, 02 Dec 2015 02:49:22 GMT
Sure. I think we need to make the consensus for the following cases. What
is the expected output for each case? That is, how does tokenizer deal with
this situation?

Record: { "id": 93, "dblpid": "journals/iandc/IbarraJCR91", "authors":
"Some Classes of Languages in NC¹", "misc": "2006-04-25 86-106 Inf. Comput.
January 1991 90 1 db/journals/iandc/iandc90.html#IbarraJCR91" }
#1. gram-tokens("",3,false): in this case, we provide an empty string.
#2. gram-tokens($d.title,3,false): in this case, we provide non-existing
field for this record.

Best,
Taewoo

On Tue, Dec 1, 2015 at 4:25 PM, Chen Li <chenli@gmail.com> wrote:

> @Taewoo: can you help?
>
> On Tue, Dec 1, 2015 at 2:26 PM, Wenhai (JIRA) <jira@apache.org> wrote:
> > Wenhai created ASTERIXDB-1208:
> > ---------------------------------
> >
> >              Summary: ngram tokenizer failure with negative length
> >                  Key: ASTERIXDB-1208
> >                  URL:
> https://issues.apache.org/jira/browse/ASTERIXDB-1208
> >              Project: Apache AsterixDB
> >           Issue Type: Bug
> >           Components: Hyracks Core
> >             Reporter: Wenhai
> >
> >
> > drop dataverse test if exists;
> > create dataverse test;
> > use dataverse test;
> > create type DBLPOpenType as open {
> >   id: int64,
> >   dblpid: string,
> >   authors: string,
> >   misc: string
> > }
> > create dataset DBLPOpen(DBLPOpenType) primary key id;
> > insert into dataset DBLPOpen { "id": 93, "dblpid":
> "journals/iandc/IbarraJCR91", "authors": "Some Classes of Languages in
> NC¹", "misc": "2006-04-25 86-106 Inf. Comput. January 1991 90 1
> db/journals/iandc/iandc90.html#IbarraJCR91" }
> >
> > use dataverse test;
> > set import-private-functions 'true'
> > for $d in dataset DBLPOpen
> > where
> similarity-jaccard(gram-tokens("",3,false),gram-tokens($d.title,3,false))
> >= 0.5
> > return {"rec": $d}
> >
> >
> >
> > --
> > This message was sent by Atlassian JIRA
> > (v6.3.4#6332)
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message