File Conversion Platform

How and why to integrate DocShifter with Captiva

In this blogpost I’ll clarify several features of Captiva Capture and I’ll discuss the pros and cons of their import module. Secondly I’ll shed light on the successful integration of Captiva and DocShifter, which offers a Capture solution. Lastly, I’ll illustrate some extra benefits DocShifter brings to the table.

Captiva’s Role

Captiva Capture’s main goal is to: “transform all forms of human readable content into actionable information”. In order to achieve this goal, Captiva captures and processes documents from a variety of sources, such as scanners, fax servers, e-mail servers, file systems, web services, and so on. Once the data are imported, Captiva classifies the documents and extracts information which can then be exported to a variety of export locations. By making information immediately available to business departments, business processes speed up. Captiva also minimises processing errors, improves data accuracy and helps businesses reduce their paper handling steps.

E-mail import, the tricky part

When importing information from import sources such as scanners and fax servers, the structure of the information that is received, is known and defined in advance. The predictability of these input channels lies in the fact that the data can be seen as a digital rendition of printed documents, which allows an easy and controlled import. The structure of data that are imported through file systems and web services is defined by, or for, the component that will ultimately import it, so these data have a known structure. There is one source in particular however, the e-mail server, where the structure of the sent information is controlled by the sender and therefore it requires extra attention upon import.

The format of e-mail is defined through different calls – RFC’s in MIME -, since an e-mail can contain lots of different data, such as message bodies with multiple parts and inline artefacts, attachments in various formats, including non-text attachments such as audio, video, images and application programs. The e-mail body itself can also consist of various formats. In general, e-mails are sent by an e-mail client and as this list is extensive, not every client standardly formats an e-mail body the same way. Any data structure that complies with MIME can be forwarded by an e-mail server and will ultimately reach the e-mail import component of Captiva Capture.

All of these elements need to be kept in mind when developing an e-mail import component. Unfortunately, as already mentioned, the main value of Captiva Capture lies in the processing and digitising of the information included in the captured data. Captiva Capture does not identify and take into account all the different exceptions of content and data that are imported from various sources, including e-mail.

So, what’s wrong with it?

In order to correctly process e-mails sent by an e-mail client, the features supported by the client must also be supported by the e-mail import component. This is the only way to guarantee that every e-mail can be imported appropriately by the component.

With the release of Captiva Capture 7. early 2015, the Captiva e-mail import module received an update. Unfortunately, this opportunity was not seized to add basic features that could make things easier. Firstly, the module does not support the commonly used rich text format. The only option left is to process e-mails using plain text. This causes the markup and inline attachments to be completely wiped out, resulting in unrepresentative output.

Before the update, there was no way to read nor convert attached EML/MSG files out of the box. The new version of Captiva Capture however, does bring some support for mail in mail in Microsoft’s MSG format, but not for the standard EML format. Making it possible to append the e-mail body of any attached MSG to the body of the main e-mail. Luckily, it does this for any other attached MSG in an attached MSG (mail in mail in mail). But it does not yet support the processing of any other attachments like a PDF/doc/… let alone an attached MSG in an attached MSG.

The final issue concerns the lack of flexibility of the e-mail import module, in order to detect which attachments can or need to be converted. Captiva can only handle a limited amount of file formats and has no ability to detect whether a supported file format can actually be converted (think about corrupt/password protected files or files without an extension). While identifying, excluding or appropriately handling exceptions would prevent a great deal of errors in the long run.

Getting E-mail right

Many of the e-mail issues that Captiva has, can be solved by combining it with DocShifter. But how? Well, DocShifter is a document conversion platform that contains several e-mail import, processing and export modules. These modules are capable of importing and analysing e-mails, and they also deliver a package that can be imported and processed by Captiva without any errors.

First and foremost, there is a DocShifter e-mail import module that is capable of importing e-mails from an e-mail server using POP3 and IMAP (+SSL), while other protocols are being added. Next up, the e-mail processing module can be used to analyse the e-mails. It detects which attachments need, and more importantly, can be converted. As DocShifter is capable of converting almost any file format, it is not bound to the supported file formats of Captiva. For each attachment that was detected as being a convertible attachment, the module generates a PDF rendition for the e-mail body of every attached e-mail. It also creates an XML metadata file that contains the original metadata for the e-mail header, the attachments and for the PDF renditions. The latter will display the e-mail in the same way as it would be printed directly from the e-mail client (including markup/inline artefacts). Finally, the e-mail export module is capable of packaging all this information into a ZIP file and delivering it to Captiva.

Integration options

Those who payed attention will note that it is actually the DocShifter export module that performs the integration with Captiva. This integration is file based, as the export module exports a ZIP that is picked up by Captiva. Afterwards, a custom code module is put in place to perform a batch health-check. If this check is positive, non-PDF files are removed from the batch (the original e-mail is not removed so no data are lost) and metadata are copied to IA values. This way, the batch is ready to be processed by Captiva without any errors.

As Captiva is capable of importing from any source and DocShifter has the ability to export to any destination system, more integration options can be added. Currently we are looking into adding integration via web services and creating a custom DocShifter Captiva module that can be added from the Captiva Developer (Designer) interface. We also expect integration with the new Captiva real time services very soon.

Extra advantages with DocShifter

Besides e-mail processing, DocShifter can also integrate with Captiva in order to perform digital sealing of documents. Digital sealing or electronic sealing (not to be confused with digital signing) is a way to guarantee the origin and integrity of documents by adding a digital seal to your electronic files.

As DocShifter is chainable, the digital sealing module can be integrated after the e-mail processing module and before the export module. In most business cases however, Captiva merges the PDF renditions of an e-mail, created by DocShifter, into a new PDF with OCR data (searchable PDF) that requires sealing of its own. But the data coming from other input sources as fax, scan, mobile, … will need to be digitally sealed as well. In order to deliver a sealed document to a business repository, DocShifter has to be configured to perform post-processing after the Captiva flow.

The conversion platform also offers a module that allows advanced compression of PDF files. This module becomes useful when DocShifter is configured to perform Captiva post-processing. Based on customer requirements, it balances the quality of the output PDF against the storage space it requires.

Conclusion

We can state that Captiva’s main quality lies in capturing and analysing information under specific circumstances, while DocShifter on the other hand, is built to convert almost any file format to any other file format. This while handling exceptions, such as files without extension and corrupt or password protected files, and adding extra functionalities such as digital sealing and hypercompression. Both products complement each other on many levels. Therefore, an integration of both products might just be the solution your company needs. But we’ll leave that up to you.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *