WebLab Content Exchange

From WebLab Wiki
Jump to navigationJump to search

This page describe the current model proposed in WebLab in order to exchange content (i.e.. the bytes behind any documents processed). This need is extremely important in the beginning of any processing chain that consume document from an unstructured source of information. In most of the case, the documents themselves come in their native format from the original source of information (for instance HTML for most websites). Technically speaking, these are either "raw" file or bytes stream. But in any case these contents are simple arrays of bytes. This kind of content is not well handled over the SOAP protocol, used in the rest of WebLab, since it was designed for text based content and in particular XML. Thus another paradigm was needed.

After testing a lot of different solutions (including byte arrays in XML, SOAP attachment, FTP...), we recommend two simple solutions:

  • simple file exchange over network shared folder ;
  • use of HTTP protocol and WebDAV server.

This two solutions reflect two different consensus and their particular advantages or drawbacks are described hereafter.

In both solutions, a Content repository acts as a central component in the platform to allow storage and exploitation of content associated to documents. It is similar to the Resource container in a way since it allows to save and get elements. However these could be in any format depending on the type of documents collected by the platform. Thus the storage process should be adapted. Moreover, the exchange of this content will not be done over the standard SOAP protocol which is optimised for text and not raw and heterogeneous binary content.

What content exactly?

In the platform, we have identified three major kind of content that could be associated to a WebLab resource. This reflects the possible use of content within a classic information processing chain and exploitation of documents.

  • Native content: the original file associated to a document is called its native content. It should be linked to the WebLab document in XML through the use of the hasNativeContent property through its URL (URL being valid and accessible by any service).
  • Normalised content: one of the first step in processing chain can be the conversion of the native content into a normalised version of the content in order to abstract from the various formats that could be used to encode documents (for instance multiple audio compressed formats). This newly created normalised content can be linked to the WebLab document in XML through the use of the hasNormalisedContent property through its URL (URL being valid and accessible by any service).
  • Exposed content: the final user of the application may want to access the document in one of these native formats (might be native or other kind of format that might be specific to the UI like videos for instance). However given the particular network infrastructure where the application is deployed, and including that the user may have a remote access to the application, the access to native and/or normalised content may be problematic (network access restriction...). To overcome these difficulties, a specific service could be in charge to set up an exposed content linked to the document. This particular content will be linked to the WebLab document in XML through the use of the hasExposedContent property through its URL. However, this URL will specific in order to make it available from user side. It may also be a partial URL only in order to let the UI create the full link properly.


Current solutions

File exchange over network shared folder

The simplest solution is to rely on a standard shared file system. In that case the content URL should follow the File URI scheme.

"pros":

  • ease of use: nothing to change on service side since content are accessible as simple file.
  • efficiency: using the OS native file system for sharing files (possibly among a network) is obviously efficient (no detailed benchmark though...).

"cons":

  • only possible on local network: this solution does not easily allow to share content through the web (i.e.. when the system is distributed) and thus is only possible on installation that are on the same network.
  • rights managements: access rights to content may need to go through OS user rights which could be complex to synchronised with application level user rights (no complete tests so far).

HTTP and WebDAV server

The WebDAV solution rely on a central server, accessible from any service in the platform and which provide access to content through HTTP (standard HTTP for reading content and WebDAV extension for writing content). In that case the content URL should follow the http URI scheme.

"pros":

  • keeping on HTTP (as SOAP is already used): HTTP protocol is used and thus only the port 80 should be opened by services (or 443 for HTTPS). Since this port should already be configured for SOAP exchange, that's easy.
  • rights managed on WebDAV server: WebDAV is meant to handle correctly rights access.

"cons":

  • less efficient: the complexity added by the HTTP layer should lead to less efficient exchange (no detailed benchmark though...).
  • need access to WebDAV server: the server should be accessible from any service in the platform which could lead to the set-up of a particular DNS/network gateway or host-name definition on each physical server.

A dedicated CRUD component

In order to define your own CRUD content manager, you have to extend org.ow2.weblab.content.api.ContentManagerAdapter (or to implement the org.ow2.weblab.content.api.ContentManagerInterface interface):

public URI create(final InputStream input, final Resource resource, final Map<String, Object> parameters)

public InputStream read(final URI uri, final Resource resource, final Map<String, Object> parameters)

public URI update(final InputStream input, final URI uri, final Resource resource, final Map<String, Object> parameters)

public boolean delete(final URI uri, final Resource resource, final Map<String, Object> parameters)