lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allison, Timothy B." <talli...@mitre.org>
Subject RE: outlook email file pst extraction problem
Date Wed, 02 Mar 2016 12:47:06 GMT
This is probably more of a Tika question now...

It sounds like Tika is not extracting dates from the .eml files that you are generating? 
To confirm, you are able to extract dates with libpst...it is just that Tika is not able to
process the dates that you are sending it in your .eml files?

If you are able to share an .eml file (either via personal email or open a ticket on Tika's
jira if you think this is a bug in Tika), I can take a look.

-----Original Message-----
From: Sreenivasa Kallu [mailto:sreenivasakallu@gmail.com] 
Sent: Monday, February 29, 2016 7:17 PM
To: solr-user@lucene.apache.org
Subject: Re: outlook email file pst extraction problem

Thanks Timothy for your prompt help.

 I tried first option. I am able to extract .eml ( MIME format) files from PST file using
libpst library.
 I am not able extract .msg ( outlook emails) files using libpst library. I am able to feed
.eml files into SOLR.
 I can see some of tags are missing in the extraction of .eml files in SOLR. Specially date
tags are missing in the .eml file tags comparative with .msg file generated tags. How to generate
date tags with .eml files.
My SOLR program stopped working due lack of date tags and same program worked file  with .msg
files. Any suggestion to generate date tags with .eml  files?  Is it good idea to look JPST
or aspose ( both are 3rd party libraries to extract .msg files from PST file) for case?

Advanced Thanks.

--sreenivasa kallu

On Thu, Feb 11, 2016 at 11:55 AM, Allison, Timothy B. <tallison@mitre.org>
wrote:

> Should have looked at how we handle psts before earlier response....sorry.
>
> What you're seeing is Tika's default treatment of embedded documents, 
> it concatenates them all into one string.  It'll do the same thing for 
> zip files and other container files.  The default Tika format is 
> xhtml, and we include tags that show you where the attachments are.  
> If the tags are stripped, then you only get a big blob of text, which 
> is often all that's necessary for search.
>
> Before SOLR-7189, you wouldn't have gotten any content, so that's 
> progress...right?
>
> Some options for now:
> 1) use java-libpst as a preprocessing step to extract contents from 
> your psts before you ingest them in Solr (feel free to borrow code 
> from our OutlookPSTParser).
> 2) use tika from the commandline with the -J -t options to get a Json 
> representation of the overall file, which includes a list of maps, 
> where each map represents a single embedded file.  Again, if you have 
> any questions on this, head over to user@tika.apache.org
>
> I think what you want is something along the lines of SOLR-7229, which 
> would treat each embedded document as its own document.  That issue is 
> not resolved, and there's currently no way of doing this within DIH 
> that I'm aware of.
>
> If others on this list have an interest in SOLR-7229, let me know, and 
> I'll try to find some time.  I'd need feedback on some design decisions.
>
>
>
>
>
> -----Original Message-----
> From: Sreenivasa Kallu [mailto:sreenivasakallu@gmail.com]
> Sent: Thursday, February 11, 2016 1:43 PM
> To: solr-user@lucene.apache.org
> Subject: outlook email file pst extraction problem
>
> Hi ,
>        I am currently indexing individual outlook messages and 
> searching is working fine.
> I have created solr core using following command.
>  ./solr create -c sreenimsg1 -d data_driven_schema_configs
>
> I am using following command to index individual messages.
> curl  "
>
> http://localhost:8983/solr/sreenimsg/update/extract?literal.id=msg9&up
> refix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/home/ec2-user/msg9.msg"
>
> This setup is working fine.
>
> But new requirement is extract messages using outlook pst file.
> I tried following command to extract messages from outlook pst file.
>
> curl  "
>
> http://localhost:8983/solr/sreenimsg1/update/extract?literal.id=msg7&u
> prefix=attr_&fmap.content=attr_content&commit=true
> "
> -F "myfile=@/home/ec2-user/sateamc_0006.pst"
>
> This command extracting only high level tags and extracting all 
> messages into one message. I am not getting all tags when extracted 
> individual messgaes. is above command is correct? is it problem not using recursion?
>  how to add recursion to above command ? is it tika library problem?
>
> Please help to solve above problem.
>
> Advanced Thanks.
>
> --sreenivasa kallu
>
Mime
View raw message