lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hao Wu <echowu...@gmail.com>
Subject Re: [lucy-user] Chinese support?
Date Tue, 21 Feb 2017 00:12:37 GMT
Still have problem when I try to update the index using the custom analyzer.

If I comment out the
   truncate => 1

rerun I got the following errror.


'body' assigned conflicting FieldType
        LUCY_Schema_Spec_Field_IMP at cfcore/Lucy/Plan/Schema.c line 124
        at /home/hwu/perl5/lib/perl5/x86_64-linux-gnu-thread-multi/Lucy.pm
line 118.
        Lucy::Index::Indexer::new('Lucy::Index::Indexer', 'index',
'/home/hwu/data/lucy/mitbbs.index', 'schema',
'Lucy::Plan::Schema=SCALAR(0x211c758)', 'create', 1) called at
mitbbs_index.pl line 26
*** Error in `perl': corrupted double-linked list: 0x00000000021113a0 ***

If I switch the analyzer to  Lucy::Analysis::StandardTokenize.  works fine.
a new seg_2 is created.

my $tokenizer = Lucy::Analysis::StandardTokenizer->new;
my $raw_type = Lucy::Plan::FullTextType->new(
        analyzer => $tokenizer,
);

So I guess I must miss something in the custom Chinese Analyzer.



------------------my script--------------------

#!/usr/local/bin/perl
#TODO: update doc, instead create everytime
use DBI;
use File::Spec::Functions qw( catfile );

use Lucy::Plan::Schema;
use Lucy::Plan::FullTextType;
use Lucy::Index::Indexer;

use ChineseAnalyzer;

my $path_to_index = '/home/hwu/data/lucy/mitbbs.index';

# Create Schema.
my $schema = Lucy::Plan::Schema->new;

my $chinese = ChineseAnalyzer->new();

my $raw_type = Lucy::Plan::FullTextType->new(
        analyzer => $chinese,
);

$schema->spec_field( name => 'body',  type => $raw_type);

# Create an Indexer object.
my $indexer = Lucy::Index::Indexer->new(
    index    => $path_to_index,
    schema   => $schema,
    create   => 1,
    truncate => 1,
);

my $driver   = "SQLite";
my $database = "/home/hwu/data/mitbbs.db";
my $dsn = "DBI:$driver:dbname=$database";
my $dbh = DBI->connect($dsn,{ RaiseError => 1 })  or die $DBI::errstr;


my $stmt = qq(SELECT id, text from post where id >= 100 and id < 200;);
#my $stmt = qq(SELECT id, text from post where id < 100;);
my $sth = $dbh->prepare( $stmt );
my $rv = $sth->execute() or die $DBI::errstr;

while(my @row = $sth->fetchrow_array()) {
      print "id = ". $row[0] . "\n";
      print $row[1];
      my $doc = { body => $row[1] };
      $indexer->add_doc($doc);
}

$indexer->commit;

print "Finished.\n";

On Sat, Feb 18, 2017 at 6:46 AM, Nick Wellnhofer <wellnhofer@aevum.de>
wrote:

> On 18/02/2017 07:22, Hao Wu wrote:
>
>> Thanks. Get it work.
>>
>
> Lucy's StandardTokenizer breaks up the text at the word boundaries defined
> in Unicode Standard Annex #29. Then we treat every Alphabetic character
> that doesn't have a Word_Break property as a single term. These are
> characters that match \p{Ideographic}, \p{Script: Hiragana}, or
> \p{Line_Break: Complex_Context}. This should work for Chinese but as Peter
> mentioned, we don't support n-grams.
>
> If you're using QueryParser, you're likely to run into problems, though.
> QueryParser will turn a sequence of Chinese characters into a PhraseQuery
> which is obviously wrong. A quick hack is to insert a space after every
> Chinese character before passing a query string to QueryParser:
>
>     $query_string =~ s/\p{Ideographic}/$& /g;
>
> Nick
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message