WebLab Content Exchange
This page describe the current model proposed in WebLab in order to exchange content (ie. the bytes behind any documents processed). This need is extremely important in the beginning of any processing chain that consume document from an unstructured source of information. In most of the case, the documents themselves come in their native format from the original source of information (for instance HTML for most websites). Technically speaking, these are either "raw" file or bytes stream. But in any case these contents are simple arrays of bytes. This kind of content is not well handled over the SOAP protocol, used in the rest of WebLab, since it was designed for text based content and in particular XML. Thus another paradigm was needed.
After testing a lot of different solutions (including byte arrays in XML, SOAP attachement, FTP...), we now recommend two simple solutions :
- simple file exchange over network shared folder ;
- use of HTTP protocol and WebDAV server.
This two solutions reflect two different consensus and their particular advantages or drawbacks will be described hereafter.
What content exactly?
In the platform, we identified 3 major kind of content that could be associated to a WebLab resource. These reflect the possible use of content within a classic information processing chain and exploitation of documents.
The simplest solution is to rely on a standard shared file system.
- ease of use: nothing to change on service side since content are accessible as simple file.
- efficiency: using the OS native file system for sharing files (possibly among a network) is obviously efficient (no detailed benchmark though...).
- only possible on local network: this solution does not easily allow to share content through the web (ie. when the system is distributed) and thus is only possible on installation that are on the same network.
- rights managements: access rights to content may need to go through OS user rights which could be complex to synchronized with application level user rights (no complete tests so far).
HTTP and WebDAV server
the WebDAV solution rely on a central server, accessible from any service in the platform and which provide access to content through HTTP (standard HTTP for reading content and WebDAV extension for writing content).
- keeping on HTTP (as SOAP is already used): HTTP protocol is used and thus only the port 80 should be opened by services (or 443 for HTTPS). Since this port should already be configured for SOAP exchange, that's easy.
- rights managed on WebDAV server: WebDAV is meant to handle correctly rights access.
- less efficient: the complexity added by the HTTP layer should lead to less efficient exchange (no detailed benchmark though...).
- need access to WebDAV server: the server should be accessible from any service in the platform which could lead to the setup of a particular DNS/network gateway or host-name definition on each physical server.