Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F086317C8B for ; Mon, 11 May 2015 21:46:44 +0000 (UTC) Received: (qmail 32217 invoked by uid 500); 11 May 2015 21:46:43 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 32148 invoked by uid 500); 11 May 2015 21:46:43 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 32136 invoked by uid 99); 11 May 2015 21:46:43 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 May 2015 21:46:43 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A4E14C2A6B for ; Mon, 11 May 2015 21:46:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id xXJXYWV3IVdj for ; Mon, 11 May 2015 21:46:34 +0000 (UTC) Received: from mail-ie0-f176.google.com (mail-ie0-f176.google.com [209.85.223.176]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id 190BA24C65 for ; Mon, 11 May 2015 21:46:33 +0000 (UTC) Received: by ieczm2 with SMTP id zm2so120245196iec.2 for ; Mon, 11 May 2015 14:45:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=00CV48uehZMvthJ3glqpnLJi48JseE5ZaMrI4mWjUK4=; b=mAU854kq/GQ129keP691c3VDfxVm/eDadT/8WJR9XMPw9wLDij1CxZVyJ7h3lqIeQq eiOQyaUz/v4c+/7iiu3D20D0dhJ52+GXgSKuW3XjDc6hVc6gMkv9exnxg5YxmANV8O+Y ntYpGKfocle/tQq+42Wqv5grHKG062vfN3I/8GIVFmKtVSlZ0cX9PnfYnvf9wLnTAspy Uy/JC021Sgszrcspk75K+yI9QfHmPi9MC/858+hxYuDr1T6hrq8ujDiTX0Ii4P5KdwWZ 06IpYiwewyu2LnHlyOaX7DsxENdJi1eSTN04xXmXNgS/XRjVo799aA+g7ltn7SSn9Ckv a04A== MIME-Version: 1.0 X-Received: by 10.50.43.169 with SMTP id x9mr15859249igl.7.1431380701915; Mon, 11 May 2015 14:45:01 -0700 (PDT) Received: by 10.36.123.66 with HTTP; Mon, 11 May 2015 14:45:01 -0700 (PDT) In-Reply-To: References: Date: Mon, 11 May 2015 14:45:01 -0700 Message-ID: Subject: Re: Replacement for DefaultAnalyzer From: Lewis John Mcgibbney To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=089e01184b9a46b2cc0515d54a80 --089e01184b9a46b2cc0515d54a80 Content-Type: text/plain; charset=UTF-8 Hi Suneel, Just for context, I've implemented the following. @Override protected void map(Text key, BehemothDocument value, Context context) throws IOException, InterruptedException { String sContent = value.getText(); if (sContent == null) { // no text available? skip context.getCounter("LuceneTokenizer", "BehemothDocWithoutText") .increment(1); return; } analyzer = new StandardAnalyzer(matchVersion); // or any other analyzer TokenStream ts = analyzer.tokenStream(key.toString(), new StringReader(sContent.toString())); // The Analyzer class will construct the Tokenizer, TokenFilter(s), and CharFilter(s), // and pass the resulting Reader to the Tokenizer. @SuppressWarnings("unused") OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); CharTermAttribute termAtt = ts .addAttribute(CharTermAttribute.class); StringTuple document = new StringTuple(); try { ts.reset(); // Resets this stream to the beginning. (Required) while (ts.incrementToken()) { if (termAtt.length() > 0) { document.add(new String(termAtt.buffer(), 0, termAtt.length())); } } ts.end(); // Perform end-of-stream operations, e.g. set the final offset. } finally { ts.close(); // Release resources associated with this stream. } context.write(key, document); } I'll be testing and will update is anything else comes up. Thanks Lewis On Mon, May 11, 2015 at 2:12 PM, Lewis John Mcgibbney < lewis.mcgibbney@gmail.com> wrote: > I found Mike's blog post regarding Lucene 4.X from a while ago [0]. > In the* '*Other Changes*'* section Mike states "Analyzers must always > provide a reusable token stream, by implementing the > Analyzer.createComponents method (reusableTokenStream has been removed > and tokenStream is now final, in Analzyer)." > This provides a good bit ore context therefore I'm going to continue on > createComponents route with the aim of implementing the newer 4.X Lucene > API. > In the meantime, if you get any updated or have a code sample it would be > very much appreciated. > Thanks > Lewis > > [0] > http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html > > On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney < > lewis.mcgibbney@gmail.com> wrote: > >> Hi Suneel, >> >> On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi >> wrote: >> >>> Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in >>> the >>> TokenStream workflow in Lucene post-Lucene 4.5. >>> >> >> Yes I know that after looking into the codebase. Thanks for clarifying! >> >> >>> >>> What exactly are u trying to do and where is it u r stuck now? It would >>> help if u posted a code snippet or something. >>> >>> >> In particular I am working on the following implementation [0] which uses >> the following code >> >> TokenStream stream = analyzer.reusableTokenStream(key.toString(), new >> StringReader(sContent.toString())); >> >> Of note here is that the analyzer object is instantiated as of type >> DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream >> API is deprecated as you've noted so I am just wondering what the suggested >> API semantics are in order to achieve the desired upgrade. >> Thanks in advance again for any input. >> Lewis >> >> [0] >> https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53 >> [1] >> http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java >> >> >> > > > > -- > *Lewis* > -- *Lewis* --089e01184b9a46b2cc0515d54a80--