lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Looking for more information about Lucene
Date Wed, 23 May 2018 06:53:45 GMT
Hi Alexandre,

I don't have time for a call, but to give you some pointers, Lucene does
the following that may be related to natural language processing:
 - Word segmentation via the `Tokenizer` class. It is rather simple for
western languages (including French, see StandardTokenizer), but less for
eg. Japanese or Korean which we also support.
 - We have a couple stemmers implemented via `TokenFilter`s, including for
French, see the `org.apache.lucene.analysis.fr` package.

More answers inline below:


Le mar. 22 mai 2018 à 17:33, BABAUD Alexandre <
alexandre.babaud@soprasteria.com> a écrit :

> ·         What exactly are the type of files the software is able to deal
> with?
>

Lucene doesn't deal with file types directly, you need to be able to pass a
string or a stream of characters. If you have a text file, this is easy. If
you have PDF files, you will need to use 3rd-party libraries such as Tika
to extract content.


> ·         What about data storage? Is it stock in-house? (I am very
> concerned about data privacy)
>
Not really relevant: it's up to you to decide where you store your data.

> ·         Is it easily customizable?
>
Being a library, I guess the answer is yes.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message