lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sreenivasa Kallu <sreenivasaka...@gmail.com>
Subject Re: outlook email file pst extraction problem
Date Tue, 01 Mar 2016 00:16:40 GMT
Thanks Timothy for your prompt help.

 I tried first option. I am able to extract .eml ( MIME format) files from
PST file using libpst library.
 I am not able extract .msg ( outlook emails) files using libpst library. I
am able to feed .eml files into SOLR.
 I can see some of tags are missing in the extraction of .eml files in
SOLR. Specially date tags are missing in the .eml file tags comparative
with .msg file generated tags. How to generate date tags with .eml files.
My SOLR program stopped working due lack of date tags and same program
worked file  with .msg files. Any suggestion to generate date tags with
.eml  files?  Is it good idea to look JPST or aspose ( both are 3rd party
libraries to extract .msg files from PST file) for case?

Advanced Thanks.

--sreenivasa kallu

On Thu, Feb 11, 2016 at 11:55 AM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> Should have looked at how we handle psts before earlier response....sorry.
>
> What you're seeing is Tika's default treatment of embedded documents, it
> concatenates them all into one string.  It'll do the same thing for zip
> files and other container files.  The default Tika format is xhtml, and we
> include tags that show you where the attachments are.  If the tags are
> stripped, then you only get a big blob of text, which is often all that's
> necessary for search.
>
> Before SOLR-7189, you wouldn't have gotten any content, so that's
> progress...right?
>
> Some options for now:
> 1) use java-libpst as a preprocessing step to extract contents from your
> psts before you ingest them in Solr (feel free to borrow code from our
> OutlookPSTParser).
> 2) use tika from the commandline with the -J -t options to get a Json
> representation of the overall file, which includes a list of maps, where
> each map represents a single embedded file.  Again, if you have any
> questions on this, head over to user@tika.apache.org
>
> I think what you want is something along the lines of SOLR-7229, which
> would treat each embedded document as its own document.  That issue is not
> resolved, and there's currently no way of doing this within DIH that I'm
> aware of.
>
> If others on this list have an interest in SOLR-7229, let me know, and
> I'll try to find some time.  I'd need feedback on some design decisions.
>
>
>
>
>
> -----Original Message-----
> From: Sreenivasa Kallu [mailto:sreenivasakallu@gmail.com]
> Sent: Thursday, February 11, 2016 1:43 PM
> To: solr-user@lucene.apache.org
> Subject: outlook email file pst extraction problem
>
> Hi ,
>        I am currently indexing individual outlook messages and searching
> is working fine.
> I have created solr core using following command.
>  ./solr create -c sreenimsg1 -d data_driven_schema_configs
>
> I am using following command to index individual messages.
> curl  "
>
> http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&uprefix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/home/ec2-user/msg9.msg"
>
> This setup is working fine.
>
> But new requirement is extract messages using outlook pst file.
> I tried following command to extract messages from outlook pst file.
>
> curl  "
>
> http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&uprefix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/home/ec2-user/sateamc_0006.pst"
>
> This command extracting only high level tags and extracting all messages
> into one message. I am not getting all tags when extracted individual
> messgaes. is above command is correct? is it problem not using recursion?
>  how to add recursion to above command ? is it tika library problem?
>
> Please help to solve above problem.
>
> Advanced Thanks.
>
> --sreenivasa kallu
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message