mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lewis John Mcgibbney <lewis.mcgibb...@gmail.com>
Subject Re: Replacement for DefaultAnalyzer
Date Mon, 11 May 2015 21:45:01 GMT
Hi Suneel,
Just for context, I've implemented the following.

    @Override
    protected void map(Text key, BehemothDocument value, Context context)
            throws IOException, InterruptedException {
        String sContent = value.getText();
        if (sContent == null) {
            // no text available? skip
            context.getCounter("LuceneTokenizer", "BehemothDocWithoutText")
                    .increment(1);
            return;
        }
        analyzer = new StandardAnalyzer(matchVersion); // or any other
analyzer
        TokenStream ts = analyzer.tokenStream(key.toString(), new
StringReader(sContent.toString()));
        // The Analyzer class will construct the Tokenizer, TokenFilter(s),
and CharFilter(s),
        //   and pass the resulting Reader to the Tokenizer.
        @SuppressWarnings("unused")
        OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class);

        CharTermAttribute termAtt = ts
                .addAttribute(CharTermAttribute.class);
        StringTuple document = new StringTuple();
        try {
            ts.reset(); // Resets this stream to the beginning. (Required)
            while (ts.incrementToken()) {
                if (termAtt.length() > 0) {
                    document.add(new String(termAtt.buffer(), 0,
termAtt.length()));
                }
            }
            ts.end();   // Perform end-of-stream operations, e.g. set the
final offset.
        } finally {
            ts.close(); // Release resources associated with this stream.
      }
        context.write(key, document);
    }

I'll be testing and will update is anything else comes up.
Thanks
Lewis


On Mon, May 11, 2015 at 2:12 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> I found Mike's blog post regarding Lucene 4.X from a while ago [0].
> In the* '*Other Changes*'* section Mike states "Analyzers must always
> provide a reusable token stream, by implementing the
> Analyzer.createComponents method (reusableTokenStream has been removed
> and tokenStream is now final, in Analzyer)."
> This provides a good bit ore context therefore I'm going to continue on
> createComponents route with the aim of implementing the newer 4.X Lucene
> API.
> In the meantime, if you get any updated or have a code sample it would be
> very much appreciated.
> Thanks
> Lewis
>
> [0]
> http://blog.mikemccandless.com/2012/07/lucene-400-alpha-at-long-last.html
>
> On Mon, May 11, 2015 at 2:03 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi Suneel,
>>
>> On Sat, May 9, 2015 at 11:21 AM, Suneel Marthi <smarthi@apache.org>
>> wrote:
>>
>>> Mahout 0.9 and 0.10.0 are using Lucene 4.6.1. There's been a change in
>>> the
>>> TokenStream workflow in Lucene post-Lucene 4.5.
>>>
>>
>> Yes I know that after looking into the codebase. Thanks for clarifying!
>>
>>
>>>
>>> What exactly are u trying to do and where is it u r stuck now? It would
>>> help if u posted a code snippet or something.
>>>
>>>
>> In particular I am working on the following implementation [0] which uses
>> the following code
>>
>> TokenStream stream = analyzer.reusableTokenStream(key.toString(), new
>> StringReader(sContent.toString()));
>>
>> Of note here is that the analyzer object is instantiated as of type
>> DefaultAnalyzer [1]. It is further noted that the analyzer.reusableTokenStream
>> API is deprecated as you've noted so I am just wondering what the suggested
>> API semantics are in order to achieve the desired upgrade.
>> Thanks in advance again for any input.
>> Lewis
>>
>> [0]
>> https://github.com/DigitalPebble/behemoth/blob/master/mahout/src/main/java/com/digitalpebble/behemoth/mahout/LuceneTokenizerMapper.java#L52-L53
>> [1]
>> http://svn.apache.org/repos/asf/mahout/tags/mahout-0.7/core/src/main/java/org/apache/mahout/vectorizer/DefaultAnalyzer.java
>>
>>
>>
>
>
>
> --
> *Lewis*
>



-- 
*Lewis*

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message