lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1284) Set of Java classes that allow the Lucene search engine to use morphological information developed for the Apertium open-source machine translation platform (http://www.apertium.org)
Date Fri, 10 Apr 2009 20:00:15 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12697952#action_12697952
] 

Otis Gospodnetic commented on LUCENE-1284:
------------------------------------------

Hi Felipe,

OK, I looked at this some more.  So the Java code you contributed is ASL and Apertium's tools
(and data?) is GPL v2?

The thing that puzzles me are the language pairs themselves.  Why are they in pairs?  Is that
simply for the translation part of Apertium, and something that's ignored when you use the
pair for Lucene and morphological analysis?

If I'm interested in, say, French morphological analyzer, why do I need any other language?
 For French, I see:

* br-fr
* en-fr
* fr-ca
* fr-es

If I'm interested in French, which of the 4 above is the right one to use?  The one with the
highest number of lemmata?

I had a look at the Indexer and Searcher to get an idea about the usage.  Those classes are
really just for demonstration, right?  Still, do you mind replacing the deprecated Hits object
in the Searcher class?

In the README you mention this:
{quote}
2. The Spanish morphological dictionary must be preprocessed in advance to remove multiword
expressions:

$ java -classpath lucene-apertium-morph-2.4-dev.jar \
  org.apache.lucene.apertium.tools.RemoveMultiWordsFromDix \
  --dix apertium-es-ca.es.dix  > apertium-es-ca.es-nomw.dix
{quote}

Could you explain why the removal of multiword expressions is needed?
Is that Spanish-specific or something one needs to do regardless of the language?

Also:
{quote}
4. Each file to be indexed must be preprocessed using the Apertium tools:

$ cat file.txt | apertium-destxt | lt-proc -a es-ca-nomw.automorf.bin | apertium-tagger -g
-f es-ca.prob > file.pos.txt
{quote}

So these are a few command-line tools that end up marking up the input text with POS? (I seem
to be missing some libraries and can't compile Apterium locally to check what that this marked
up file looks like).
But my main question here is whether there are Java equivalents of these command-line tools,
so that one can easily use them from Java?  Or is one forced to use Runtime.exec(...)?

Thanks.

> Set of Java classes that allow the Lucene search engine to use morphological information
developed for the Apertium open-source machine translation platform (http://www.apertium.org)
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1284
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1284
>             Project: Lucene - Java
>          Issue Type: New Feature
>         Environment: New feature developed under GNU/Linux, but it should work in any
other Java-compliance platform
>            Reporter: Felipe Sánchez Martínez
>            Assignee: Otis Gospodnetic
>         Attachments: apertium-morph.0.9.0.tgz
>
>
> Set of Java classes that allow the Lucene search engine to use morphological information
developed for the Apertium open-source machine translation platform (http://www.apertium.org).
Morphological information is used to index new documents and to process smarter queries in
which morphological attributes can be used to specify query terms.
> The tool makes use of morphological analyzers and dictionaries developed for the open-source
machine translation platform Apertium (http://apertium.org) and, optionally, the part-of-speech
taggers developed for it. Currently there are morphological dictionaries available for Spanish,
Catalan, Galician, Portuguese, 
> Aranese, Romanian, French and English. In addition new dictionaries are being developed
for Esperanto, Occitan, Basque, Swedish, Danish, 
> Welsh, Polish and Italian, among others; we hope more language pairs to be added to the
Apertium machine translation platform in the near future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message