opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark G <ma...@apache.org>
Subject Re: request for Input or ideas.... EntityLinker tickets
Date Sun, 03 Nov 2013 01:22:41 GMT
I finished with the Lucene indexing of the Gazateers, just need to get them
tied into the gaz lookups, which is fairly simple. Do you all think I
should disregard all the MySQL dependency and just have Lucene? The lucene
index files are only about 2.5 gigs total, so very manageable to distribute
the files across a cluster. I could keep the MySQL classes as an option,
but at this point the Lucene based approach is really growing on me.
If I don't here from anyone I am going to remove the MySQL implementation.
Thanks
MG


On Wed, Oct 30, 2013 at 7:34 PM, Lance Norskog <goksron@gmail.com> wrote:

> Just to elaborate- The RAMDirectory storage is in Java GC. This makes Java
> GC work very very hard. A memory-mapped file is a write-through cache for
> file contents. The memory in the cache is outside of Java garbage
> collection. A memory-mapped index will take a little less time to create at
> these volumes. Loading a pre-built memory-mapped index will be under 5
> seconds.
>
>
> On 10/29/2013 03:43 PM, Mark G wrote:
>
>> thanks, that was my next option with lucene. Build the indexes from the
>> gaz
>> files and keep them up to date in one place, and make sure something like
>> puppet will distribute them to each node in a cluster on some interval,
>> then each task (map reduce or whatever) can use that file resource. I'll
>> let everyone know how it goes
>> MG
>>
>>
>> On Tue, Oct 29, 2013 at 6:06 PM, Lance Norskog <goksron@gmail.com> wrote:
>>
>>  This is what memory-mapped file indexes are for! RAMDirectory is for very
>>> small projects.
>>>
>>>
>>> On 10/29/2013 04:00 AM, Mark G wrote:
>>>
>>>  FYI, I implemented an in mem lucene index of the NGA Geonames. It was
>>>> almost 7 GB ram and took about 40 minutes to load.
>>>> Still looking at other DBs/Indexes. So one would need at least 10G ram
>>>> to
>>>> hold the USGS and NGA gazateers.
>>>>
>>>>
>>>> On Fri, Oct 25, 2013 at 6:21 AM, Mark G <giaconiamark@gmail.com> wrote:
>>>>
>>>>   I wrote a quick lucene RAMDirectory in memory index, it looks like a
>>>>
>>>>> valid
>>>>> option to hold the gazateers and it provides good text search of
>>>>> course.
>>>>> The idea is that at runtime the geoentitylinker would pull three files
>>>>> off
>>>>> disk, the NGAGeonames file, the USGS FIle, and the CountryContext
>>>>> indicator
>>>>> file and lucene index them in memory,. initially this will take a
>>>>> while.
>>>>> So, deployment wise, you would have to use your tool of choice (ie
>>>>> Puppet)
>>>>> to distribute the files to each node, or mount a share to each node.
My
>>>>> concern with this approach is that each MR Task runs in it's own JVM,
>>>>> so
>>>>> each task on each node will consume this much memory unless you do
>>>>> something interesting with memory mapping. The EntityLinkerProperties
>>>>> file
>>>>> will support the config of the file locations and whether to use DB or
>>>>> in
>>>>> mem Lucene...
>>>>>
>>>>> I am also working on a Postgres version of the gazateer structures and
>>>>> stored procs.
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>
>>>>> On Wed, Oct 23, 2013 at 8:46 AM, Jörn Kottmann <kottmann@gmail.com>
>>>>> wrote:
>>>>>
>>>>>   On 10/23/2013 01:14 PM, Mark G wrote:
>>>>>
>>>>>>   All that being said, it is totally possible to run an in memory
>>>>>> version
>>>>>>
>>>>>>> of
>>>>>>> the gazateer. Personally, I like the DB approach, it provides
a lot
>>>>>>> of
>>>>>>> flexibility and power.
>>>>>>>
>>>>>>>   Yes, and you can even use a DB to run in-memory which works
with
>>>>>>> the
>>>>>>>
>>>>>> current implementation,
>>>>>> I think I will experiment with that.
>>>>>>
>>>>>> I don't really mind using 3 GB memory for it, since my Hadoop servers
>>>>>> have more than enough anyway,
>>>>>> and it makes the deployment easier (don't have to deal with installing
>>>>>> MySQL
>>>>>> databases and keeping them in sync).
>>>>>>
>>>>>> Jörn
>>>>>>
>>>>>>
>>>>>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message