hivemall-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From takuti <...@git.apache.org>
Subject [GitHub] incubator-hivemall issue #82: Encoding related bug in LDA.
Date Fri, 02 Jun 2017 08:41:40 GMT
Github user takuti commented on the issue:

    https://github.com/apache/incubator-hivemall/pull/82
  
    @helenahm If I understand correctly, it sounds natural. Make sure the difference between
"**null**" and "**empty**" document. **Null** document must not exist in your document collection
(or you manually need to add the `AND ln != null` clause for workaround as you already tried;
i.e. carefully undergo preprocessing step).
    
    In fact, since `doc#3` is **null**, the following query throws an exception which you
wrote:
    
    ```sql
    with docs as (
      select docid, doc
      from (
        select 1 as docid, "Fruits and vegetables are healthy na‹ve." as doc
        union all
        select 2 as docid, "I like apples, oranges, and avocados. I do not like the flu or
colds." as doc
        union all
        select 3 as docid, null as doc
      ) t1
    ),
    word_counts as (
      select
        docid,
        feature(word, count(word)) as f
      from docs t1 LATERAL VIEW explode(tokenize(doc, true)) t2 as word
      where
        not is_stopword(word)
      group by
        docid, word
    )
    select label, word, avg(lambda) as lambda
    from (
      select
        -- train_plsa(feature, "-topics 2 -eps 0.00001 -iter 2048 -alpha 0.01") as (label,
word, lambda)
        train_lda(feature, "-topics 2 -iter 20") as (label, word, lambda)
      from (
        select docid, collect_set(f) as feature
        from word_counts
        group by docid
        -- order by docid
      ) t1
    ) t2
    group by label, word
    order by lambda desc
    ;
    ```
    
    However, if the document is just **empty**, it works:
    
    ```sql
    with docs as (
      select docid, doc
      from (
        select 1 as docid, "Fruits and vegetables are healthy na‹ve." as doc
        union all
        select 2 as docid, "I like apples, oranges, and avocados. I do not like the flu or
colds." as doc
        union all
        select 3 as docid, "" as doc
      ) t1
    ),
    ...
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

Mime
View raw message