WebLab Content Exchange
This page describe the current model proposed in WebLab in order to exchange content (ie. the bytes behind any documents processed). This need is extremely important in the beginning of any processing chain that consume document from an unstructured source of information. In most of the case, the documents themselves come in their native format from the original source of information (for instance HTML for most websites). Technically speaking, these are either "raw" file or bytes stream. But in any case these contents are simple arrays of bytes. This kind of content is not well handled over the SOAP protocol, used in the rest of WebLab, since it was designed for text based content and in particular XML. Thus another paradigm was needed.
After testing a lot of different solutions (including byte arrays in XML, SOAP attachement, FTP...), we now recommend two simple solutions :
- simple file exchange over network shared folder ;
- use of HTTP protocol and WebDAV server.
This two solutions reflect two different consensus and their particular advantages or drawbacks will be described hereafter.
In both solution, a Content repository will act as a central component in the platform to allows storage and exploitation of content associated to documents. It is similar to the Resource container in a way since it allows to save and get elements. However these could be in any format depending on the type of documents collected by the platform. Thus the storage process should be adapted. Moreover, the exchange of this content will not be done over the standard SOAP protocol which is optimized for text and not raw and heterogeneous binary content.
What content exactly?
In the platform, we identified 3 major kind of content that could be associated to a WebLab resource. These reflect the possible use of content within a classic information processing chain and exploitation of documents.
- Native content: the original file associated to a document will be its native content. It should be linked to the WebLab document in XML through the use of the hasNativeContent through its URL (URL being valid and accessible by any service).
- Normalised content: one of the first step in processing chain will be to convert the native content into a normalised version of the content in order to abstract from the potential numerous format that could e used to encode document (for instance multiple audio compressed format). This newly created normalised content will linked to the WebLab document in XML through the use of the hasNormalisedContent through its URL (URL being valid and accessible by any service).
- Exposed content: the final user of the application may want to access the document in its native format. however given the particular network infrastructure where the application is deployed, and including that the user may have a remote access to the application, the access to native and/or normalised content may be problematic (network access restriction...). To over come these difficulties, a specific service could be in charge to set up an exposed content linked to the document. This particular content will be linked to the WebLab document in XML through the use of the hasExposedContent through its URL. However, this URL will specific in order to make it available from user side.
The simplest solution is to rely on a standard shared file system. In that case the content URL should follow the File URI scheme.
- ease of use: nothing to change on service side since content are accessible as simple file.
- efficiency: using the OS native file system for sharing files (possibly among a network) is obviously efficient (no detailed benchmark though...).
- only possible on local network: this solution does not easily allow to share content through the web (ie. when the system is distributed) and thus is only possible on installation that are on the same network.
- rights managements: access rights to content may need to go through OS user rights which could be complex to synchronized with application level user rights (no complete tests so far).
HTTP and WebDAV server
The WebDAV solution rely on a central server, accessible from any service in the platform and which provide access to content through HTTP (standard HTTP for reading content and WebDAV extension for writing content). In that case the content URL should follow the http URI scheme.
- keeping on HTTP (as SOAP is already used): HTTP protocol is used and thus only the port 80 should be opened by services (or 443 for HTTPS). Since this port should already be configured for SOAP exchange, that's easy.
- rights managed on WebDAV server: WebDAV is meant to handle correctly rights access.
- less efficient: the complexity added by the HTTP layer should lead to less efficient exchange (no detailed benchmark though...).
- need access to WebDAV server: the server should be accessible from any service in the platform which could lead to the setup of a particular DNS/network gateway or host-name definition on each physical server.
A dedicated component
ContentManager : OW2 component available