lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Cowan <co...@aconex.com>
Subject Re: Multiple languages - possible approach
Date Fri, 17 Mar 2006 06:16:14 GMT
Hi Grant and Otis,

Thanks for the feedback, I appreciate it. You've given some good ideas.

> Sounds like a really interesting system!  I am curious, are your users 
> fluent in multiple languages or are you using some type of translation 
> component?

The former. We're talking about construction projects, where English is 
(generally) something of a Lingua Franca, as it were (a really big 
construction project these days might use Australian architects, British 
managers and UAE-based engineers on a project in Shanghai). So we might 
have an architect forwarding a message on to an engineer in English, she 
forwards it to the ground team in Shanghai in English, but they then 
discuss it amongst themselves in Chinese... all in the space of one 
forwarded email.

> How are you querying?  Are users entering mixed language queries too?  
<snip>

Good question(s). Automatically detecting the indexing language doesn't 
NECESSARILY help us with the searching, as we'll have a lot less text to 
work with. On the plus side, we can always ASK what language the text 
they're searching for is with a drop-down or something; we can't really 
ask what language their correspondence is in, as it may be mixed.

Multiple indexes is an option but we're very concerned about performance 
and size -- we're talking many many millions of things to index, having 
English/Chinese/Arabic/who knows what else indexes could be nightmare.

> Also, is the text so finely delineated as your example?  We sometimes 
> run across the case where foreign languages will use other languages 
> (mostly English) mid-sentence and it makes things quite ugly.   Approach 
> 4 should handle this, though

Yeah, that's one of our worries. People often can't find the right word 
for what they want to say, etc., so they slip back into another language.

Anyway, thanks for that and the rest of the ideas. We think that 
StandardAnalyzer will do us for now (Chinese only); when we hit more 
complicated languages I'll come up with a plan/design for the "Super 
Analyzer" and post it to this list for discussion and/or flamewar.

Cheers,

Paul

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message