Camel-WebLab2.0.3

From WebLab Wiki
Jump to navigationJump to search

The Camel-WebLab component provides:

  • access to the WebLab Document factory to create a WebLab Document from other Camel Components or messages from other components.
  • access to the OSGi registry allowing documents to be processed by WebLab services referenced as OSGi services.
  • access to CRUD methods on contents managed by a content-manager
  • a warc Splitter and an Aggregation Strategy to merge WebLab resources.
  • some custom types that can be used to convert dates metadata.

The Camel-WebLab component is used in the ESB.


WebLab Resource Factory

This component is only available as a Camel Producer. It allows the user to create a WebLab resource from other Camel Components or messages from other components. When possible, the body of the input message is used to set the content of the newly created resource. Incoming message body contents accepted by the producer as valid content are 'ByteArrayOutputStream', 'String', 'InputStream' and'File' or a bean annotated with WebLab annotations.


URI Format

weblab:create[?options]

or

weblab://create[?options]

You can append query options to the URI in the following format, ?option=value&option=value&...


URI Options

Producer

Name Default Value Description
type null REQUIRED. The WebLab type of the produced resource. The expected value is a compatible WebLab resource class (Document, Text, Image, Audio, Video, Resource, Annotation, PieceOfKnowledge). The WebLab type should be defined in accordance with the input message content type. Note that no content verification is done by the producer so, for example, an image (byte array) can be set as native content of a Text resource without error.
writeContent false Write the message body content as the unit embedded content. The default behaviour of the producer is to write the passed input exchange message body as the native content. In this case, the passed input exchange message body is written by the ContentManager and the hasNativeContent URI is added to the output WebLab resource as an annotationt. If writeContent is set to true, the passed content (input exchange message body) is directly added to created WebLab MediaUnit content part.
outputMethod null The default behaviour of the producer is to return the created WebLab resource as a Java object. When the outputMethod is set to XML, the WebLab resource is returned as an XML document.
streaming false When the output method is set to XML, the default behaviour of the producer is to marshall the created WebLab resource into a String. When the streaming is set to true, the producer marshalls the created WebLab resource into a stream.
filterNonXMLChars true When the type option is set to Text, the default behaviour of the producer is to skip non XML chars (see XMLChars) in the text content.
documentWrapped false When the documentWrapped option is set to true, and the created resource is a Media Unit, then it is wrapped in a WebLab document instead to be returned as a Media Unit. In all cases, annotations will desribe always the created Media Unit (not the Wrapping document).
prefixMap A map initialized with prefix wlm, wlp, rdf, rdfs, dc and dct corresponding respectively to model, processing, rdf, rdfs, dublinCore elements and dublinCore terms namespaces When a custom property is used to annotate a resource, the corresponding prefix must be defined in a map and passed to the WebLab producer by this option.


Message Headers

Name Description
weblab:prefix:property[@lang=isoCode] To annotate the created WebLab resource with the property defined in the header. The complete property URI is defined by the prefix and property parts of the header name. The header value corresponds to the property value. If the value is language dependant, the optional @lang part of the header name is used to set the annotation language attribute (ISO 639 codes). Default WebLab annotators are used to annotate the resource. Predefined prefixes are wlm, wlp, rdf, rdfs, dc and dct corresponding respectively to model, processing, rdf, rdfs, dublinCore elements and dublinCore terms namespaces. All other prefixes have to be defined and passed to the WebLab producer with the 'prefixMap' option. Note that the header value object's class is used to set the rdf datatype of produced RDF triples (for example the Date class is used to set rdf datatype to XMLSchema#dateTime).


Create a WebLab Text Resource

The WebLab producer will create a WebLab text resource Java object as the exchange body :

      <!-- Set the body with text. This text will be used by the producer to create the resource content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>
       				
      <!-- To produce a Text with content embedded from body as a string -->
      <to uri="weblab://create?type=Text"/>
      
      <!-- Now the body is a WebLab Text java object -->


To create a XML WebLab Text resource inside a Camel route

The WebLab producer will create a WebLab Text resource XML representation as the exchange body :

      <!-- Set the body with text. This text will be used by the producer to create the resource content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>
       				
      <!-- To produce a Text with content embedded from body as a string -->
      <to uri="weblab://create?type=Text&amp;outputMethod=xml"/>
      
      <!-- Now the body is an XML WebLab Text -->


To create an annotated XML WebLab resource inside a Camel route

The WebLab producer will create an annotated WebLab Text resource XML representation as the exchange body. Here, the Text will be annotated with :

      <!-- Set the body with text. This text will be used by the producer to create the resource content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>

      <!-- Set properties used by the producer to annotate the created resource -->
      <setHeader headerName="weblab:dct:created">
         <simple resultType="java.util.Date">${date:now:YYYY-MM-dd'T'hh:mm:ss}</simple>
      </setHeader>
      <setHeader headerName="weblab:dc:title@lang=en">
     	 <simple>The title of my resource</simple>
      </setHeader>
 				
      <!-- To produce a Text with content embedded from body and annotations from headers -->
      <to uri="weblab://create?type=Text&amp;outputMethod=xml"/>
      
      <!-- Now the body is an XML WebLab Text -->


To create an annotated WebLab resource inside a Camel route with custom properties

The WebLab producer will create an annotated WebLab resource with content from the exchange body and annotations from headers. Here, the Document will be annotated with :

Define a prefix map bean outside of the camel context (in the spring context) :

      <util:map id="myPrefixMap">
	<!-- standard ontologies prefix (if needed) -->
	<entry key="dc" value="http://purl.org/dc/elements/1.1/"/>
		
	<!-- custom ontologies prefix -->
	<entry key="myOntoPrefix" value="http://mySpecialOntology#"/>
      </util:map>

Call the producer inside a route by referencing the defined prefix map :

      <!-- Set the body with text. This text will be used by the producer to create the resource native content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>

      <!-- Set properties used by the producer to annotate the created resource -->
      <setHeader headerName="weblab:dc:title@lang=en">
         <constant>The title of my resource</constant>
      </setHeader>
      <setHeader headerName="weblab:myOntoPrefix:mySpecialProperty@lang=fr">
	 <constant>ma valeur</constant>
      </setHeader>
 				
      <!-- To produce an annotated Document with native content from body and annotations from headers -->
      <to uri="weblab://create?type=Document&amp;prefixMap=#myPrefixMap" />


Bean support

Camel-weblab may use WebLab Annotations to create or enrich a document.

WebLab Service Registry

This component is only available as a Camel Producer. It allows the user to call a WebLab service defined as an OSGi service reference. Thus, it supports distributed services: you can add and remove WebLab services any time and reuse services already deployed for your own new routes.

URI Format

weblab:service-interface:service-name[?options]

or

weblab://service-interface:service-name[?options]

Where service-interface represents one of the supported WebLab Service Interfaces:

  • analyser
  • indexer
  • resourceContainer
  • trainable
  • cleanable

and where service-name corresponds to the OSGI service name property of the WebLab service to request.

You can append query options to the URI in the following format, ?option=value&option=value&...

URI Options

Producer

Name Default Value Description
operation null It defines the service operation to be called. Optionnal for interfaces with only one operation (analyser, indexer, cleanable) but required otherwise (trainable, resourceContainer). For exemple to call the loadResource method of the resourceContainer add operation=loadResource to the endpoint URI.
filter null It defines which OSGi filter must be used to search for the service in the OSGi registry. By default the filter is (name=service-name).

=>You have to URL encode the filter, see how to select a WebLab service with an OSGi filter for more details about its usage.

type null The default behaviour of the producer is to search for the service-name in the OSGi service registry, if type=direct then service-name will not be looked for in the OSGi registry, filter options will be ignored and it will be called as a Camel endpoint. See how to call WebLab service directly for more details about its usage.

Register and call a WebLab service

WebLab services must be registered in order to be available from the WebLab endpoint.

To register a WebLab service named my-service with the address http://localhost:8181/my-service/analyser:

	<weblab:analyser id="my-service" address="http://localhost:8181/my-service/analyser" />

To call the previous WebLab service from any Camel route, you MUST refer to its service name (my-service):

	<!-- a timer will call the route only once -->
	<from uri="timer:callOnlyOnce?repeatCount=1" />

	<!-- we create an empty document -->
	<to uri="weblab://create?type=Document" />

	<!-- we send it to the OSGi service who's name property is 'service-my-service'  -->
	<to uri="weblab:analyser:my-service" />

Do not forget to add following xsd definitions:

xmlns:weblab="http://weblab.ow2.org/schema/spring"
       	 xsi:schemaLocation="http://weblab.ow2.org/schema/spring 
       	        	     http://weblab.ow2.org/schema/spring/weblab-spring.xsd"

More details and examples are given in the tutorial on how to create a WebLab processing chain

Use OSGi filters to find and to call a WebLab service

The option filter allows you to define filters instead of using the service name property.

When you define a WebLab Service, you can add OSGi properties to use in Service registry filters:

	<weblab:analyser id="my-service" address="http://localhost:8181/my-service/analyser">
		<weblab:osgiProperties>
    			<entry key="classification" value="named-entity-extraction"/>
  		</weblab:osgiProperties>
	</weblab:analyser>


The following example will select a service who's property classification is define as named-entity-extraction:

	<from uri="direct:start" />
	<to uri="weblab://create?type=Document" />
	<to uri="weblab:analyser?filter=(classification=named-entity-extraction)" />

You can use the OSGi filter syntax to define more complex filters, e.g. you may want to search for a service by its classification and its name, however you need to encode the filter value in the URI:

	<from uri="direct:start" />
	<to uri="weblab://create?type=Document" />
	<to uri="weblab:analyser?filter=(%26(name=service-gate)(classification=named-entity-extraction))" />

Request a WebLab service directly (without an OSGi service name or filter)

You may want to call a WebLab service without sharing it as an OSGi service, for example: a new service for integration test purposes.

The option type=direct allows you to call a WebLab Service without registering it as an OSGi service reference first. When using type=direct, the service-name must be an URI describing the Camel Endpoint of the service (for example: cxf:bean:my-cxf-endpoint).

E.g. to define the use of a Gazetteer service exposed on a Tomcat server, you need to define a CXF Endpoint and refer to it using a Camel URI (cxf:bean:gazetteer):

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camel="http://camel.apache.org/schema/spring"
	xmlns:cxf="http://camel.apache.org/schema/cxf" xmlns:osgi="http://www.springframework.org/schema/osgi"
	xsi:schemaLocation="http://www.springframework.org/schema/beans 
         http://www.springframework.org/schema/beans/spring-beans-3.0.xsd 
         http://camel.apache.org/schema/spring 
         http://camel.apache.org/schema/spring/camel-spring.xsd
 	 http://www.springframework.org/schema/osgi  
       	 http://www.springframework.org/schema/osgi/spring-osgi.xsd
         http://camel.apache.org/schema/dataFormats 
         http://camel.apache.org/schema/spring/camel-spring.xsd
         http://cxf.apache.org/jaxws
         http://cxf.apache.org/schemas/jaxws.xsd
         http://camel.apache.org/schema/cxf 
         http://camel.apache.org/schema/cxf/camel-cxf.xsd">

	<!-- Define the CXF Endpoint of a Gazetteer service -->
	<cxf:cxfEndpoint id="gazetteer"
		address="http://localhost:8181/simple-gazetteer/simple-gazetteer"
		serviceClass="org.ow2.weblab.core.services.Analyser">
		<cxf:properties>
			<entry key="dataFormat" value="PAYLOAD" />
			<entry key="allowStreaming" value="true" />
		</cxf:properties>
	</cxf:cxfEndpoint>

	<!-- Create a Camel Context -->
	<camelContext id="DirectWebLabServiceCallCamelContext"  xmlns="http://camel.apache.org/schema/spring">

		<!-- With a Route -->
		<route id="DirectWebLabServiceCallRoute" streamCache="true" autoStartup="true">
			
			<!-- call it only once -->
			<from uri="timer:callOnlyOnce?repeatCount=1" />

			<!-- create an empty WebLab Document --> 
			<to uri="weblab://create?type=Document" />

			<!-- Call the WebLab endpoint with option type=direct referencing the Camel Endpoint URI: cxf:bean:gazetteer -->
			<to uri="weblab:analyser:cxf:bean:gazetteer?type=direct" />

			<!-- Final message -->
			<log message="Analyser Service simple-gazetteer processed the document"/>
		</route>

	</camelContext>

</beans>

Call a WebLab resourceContainer to save a resource

A resource container service endpoint must be called with a resource as message body (XML). The response body is the resourceURI.


E.g. to call a ResourceContainer service chained to save a resource

<?xml version="1.0" encoding="UTF-8"?>

...
<!-- Save a WebLab resource -->
<camelContext id="ContainerCamelContext"  xmlns="http://camel.apache.org/schema/spring">

	<!-- With a Route -->
	<route id="DirectSaveResourceRoute" streamCache="true" autoStartup="true">
			
		<!-- call it only once -->
		<from uri="timer:callOnlyOnce?repeatCount=1" />
                
                <setBody>
                    <constant>Hello</constant>
                </setBody>

                <!-- To produce a Document with a Text with content embedded from body as a string -->
		<to uri="weblab://create?type=Text&amp;writeContent=true&amp;documentWrapped=true"/>

		<!-- save the resource to a ResourceContainer service --> 
		<to uri="weblab://container:myContainerService?operation=saveResource" />

		<!-- Final message -->
		<log message="Resource saved."/>
	</route>

</camelContext>
...


Call a WebLab resourceContainer to load a resource

A resource container service endpoint must be called with a resource URI as message body. The response body is the XML content of the corresponding resource.


E.g. to call a ResourceContainer service chained to an analyser service

<?xml version="1.0" encoding="UTF-8"?>

...
<!-- Load a WebLab resource -->
<camelContext id="ContainerCamelContext"  xmlns="http://camel.apache.org/schema/spring">

	<!-- With a Route -->
	<route id="DirectGetResourceAndCallANalyserRoute" streamCache="true" autoStartup="true">
			
		<!-- call it only once -->
		<from uri="timer:callOnlyOnce?repeatCount=1" />

                <!-- setting resource URI -->
                <setBody>
                    <constant>weblab://resource/MyTestURI</constant>
                </setBody>

		<!-- load the resource form a ResourceContainer service --> 
		<to uri="weblab://container:myContainerService?operation=loadResource" />

		<!-- Call the WebLab endpoint with option type=direct referencing the Camel Endpoint URI: cxf:bean:gazetteer -->
		<to uri="weblab:analyser:myAnalyserService" />

		<!-- Final message -->
		<log message="Resource Loaded and Analysed"/>
	</route>

</camelContext>
...


Call a WebLab Cleanable service to remove resources

A WebLab cleanable endpoint must be called with a Collection of resource URI, an URI or String value as message body. The response body is empty.


E.g. to delete a list of WebLab resources

<?xml version="1.0" encoding="UTF-8"?>

...
<!-- Clean a WebLab resource -->
<camelContext id="CleanableCamelContext"  xmlns="http://camel.apache.org/schema/spring">

	<!-- With a Route -->
	<route id="DirectCleanResourcesRoute" streamCache="true" autoStartup="true">
			
		<!-- call it only once -->
		<from uri="direct:cleanResourcesRoute" />

		<!-- call Cleanable service --> 
		<to uri="weblab:cleanable:myCleanableService"/>

		<!-- Final message -->
		<log message="Resources cleaned"/>
	</route>

</camelContext>
...

This route can be called by passing a Collection as body :

// list of resources to clean
List<String> uris = new ArrayList<String>();
uris.add("weblab://resource/MyResource1");
uris.add("weblab://resource/MyResource2");

// call the route with list as body	
producer.sendBody("direct:cleanResourcesRoute", uris);

But also by passing an URI or a String as body :

// call the route with an URI/String as body	
producer.sendBody("direct:cleanResourcesRoute", "weblab://resource/MyResource1");

See Also


WebLab content management CRUD methods

This component is only available as a Camel Producer. It allows the user to manage contents of a WebLab resource with CRUD methods. When possible, the body of the input message is used to set the content (Create / Update methods) of a given WebLab resource. Incoming message body content accepted by the producer as valid content for 'create' and 'update' methods is an 'InputStream'. An optionnal WebLab resource may be passed for methods 'create', 'read', 'update'.In that case, it have to be defined in the dedicated message header 'WEBLAB_CONTENT_RESOURCE'. In the case of 'read' and 'delete' methods, a content URI defined in the message body or in the header 'WEBLAB_CONTENT_URI' is required.


URI Format

weblab:content:method[?options]

or

weblab://content:method[?options]

with method value like 'create', 'read', 'update' or 'delete'.

You can append query options to the URI in the following format, ?option=value&option=value&... Notes that available options depends on the content manager implementation.

URI Options

======FileContentManager implementation

No option available

======WebDAVContentManager

No option available

======CamelContentManager

To be completed

Message Headers

Name Description
WEBLAB_CONTENT_RESOURCE The WebLab resource associated to content to manage.
WEBLAB_CONTENT_URI The URI of the WebLab resource content to manage.


Create a content for a WebLab resource

The WebLab producer will create a content for the WebLab resource by using the exchange body :

      <!-- body is an input stream -->
       				
      <!-- create content -->
      <to uri="weblab://content:create"/>
      
      <!-- body is the URI of the created content -->

Read a content of a WebLab resource

The WebLab producer will read the content of the WebLab resource to the exchange body :

      <!-- body is the content URI  -->
       				
      <!-- create content -->
      <to uri="weblab://content:read"/>
      
      <!-- body is the content input stream  -->

Update a content for a WebLab resource

The WebLab producer will update a content for the WebLab resource by using the exchange body :

       			
      <setHeader headerName="WEBLAB_CONTENT_URI">
            <simple>weblab://contentURI</simple>
      </setHeader>

      <!-- body is the content input stream  -->	
      <!-- update content -->
      <to uri="weblab://content:update"/>

      <!-- body is the same content URI  -->

Delete a content of a WebLab resource

The WebLab producer will delete the content of the WebLab resource :

      <!-- body is the content URI -->
       				
      <!-- create content -->
      <to uri="weblab://content:delete"/>
      
      <!-- body is True if the content has been deleted or null otherwise -->

Warcs Splitter

This component is only available as a Camel Splitter (split method). It allows to split warc files into seperate records as camel exchange messages. Each message body corresponds to a record content and message headers are filled with all available archive/record headers.

The split method have to be declared as a Spring Bean inside of a camel context.

Splitter bean declaration

	<bean id="warcSplitter" class="org.ow2.weblab.engine.camel.WarcSplitter">
		<property name="copyWarcInfoHeaders" value="true"/>
	</bean>

Splitter bean properties

Name Default Value Description
copyWarcInfoHeaders false Enable this option if you want to copy the archive headers informations to the exchange message header 'ArchiveInfoPayloadHeader' (see above).

Message Headers

Headers added to each message :

Name Description
ArchiveInfoPayloadHeader Optionnal header that contains warc headers of the archive file. Here an example of headers that could be descibed in this header : ArchiveInfoPayloadHeader={robots=obey, software=Heritrix/3.1.0 http://crawler.archive.org, http-header-user-agent=Mozilla/5.0 (compatible; heritrix/3.1.0 +http://mozilla.org), conformsTo=http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf, description=Basic crawl starting with useful defaults, isPartOf=basic, format=WARC File Format 1.0, ...}. This header is available only if the 'copyWarcInfoHeaders' property is set to 'true'.
ArchiveRecordHeader Header that contains warc header of the current record of the archive file. Here an example of headers that could be descibed in this header : ArchiveRecordHeader={WARC-Profile=http://netpreserve.org/warc/1.0/revisit/identical-payload-digest, WARC-Type=revisit, WARC-Truncated=length, WARC-Date=2013-12-22T00:15:13Z, Content-Length=397, WARC-Record-ID=<urn:uuid:481dcbd9-80c2-4545-b20c-fb851850c6d5>, WARC-Payload-Digest=sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ, Content-Type=application/http; msgtype=response, ...}
ArchiveRecordPayloadHeader Header that contains warc header of the current record payload of the archive file. Here an example of headers that could be descibed in this header : ArchiveRecordPayloadHeader={Age=69, ETag="345fd30f-409a-4eb11810acfd5", Last-Modified=Wed, 13 Nov 2013 16:31:58 GMT, Connection=close, Server=Apache/2.2.16 (Debian), Cache-Control=max-age=120, Date=Sun, 22 Dec 2013 01:39:24 GMT, Vary=Accept-Encoding, Content-Type=text/html;charset=utf-8, Accept-Ranges=bytes, ...}
ArchiveRecordHttpHeader Header that contains http headers of the current record of the archive file. Here an example of headers that could be descibed in this header : ArchiveRecordHttpHeader={HttpHeader : [HttpResultCode: 200, HttpProtocolVersion: HTTP/1.1, HttpContentType: text/html; charset=UTF-8]}


For more information about the header fields you can:

To split a warc file inside a camel route

Here an example of a warc file splitting inside of a camel route, where each html or pdf record is used to create a new WebLab document annotated with values from the record headers:

        ... body is a warc file

	<split>
	    <method bean="warcSplitter" method="splitMessage" />
						
	    <filter>
                <simple>${headers.ArchiveRecordHttpHeader.protocolStatusCode} range '200..300' and ${headers.ArchiveRecordPayloadHeader['Content-Type']} regex '(.*html.*)|(.*pdf.*)'</simple>

		<setHeader headerName="weblab:wlp:hasGatheringDate">
		    <simple resultType="java.util.Date">${headers.ArchiveRecordHeader['WARC-Date']}</simple>
		</setHeader>

		<setHeader headerName="weblab:wlp:hasOriginalFileSize">
		    <simple>${headers.ArchiveRecordPayloadHeader['Content-Length']}</simple>
		</setHeader>
		<setHeader headerName="weblab:dc:language">
		    <simple>${headers.ArchiveRecordPayloadHeader['Content-Language']}</simple>
		</setHeader>
							
		<!-- To produce a WebLab resource from the record content -->
		<to uri="weblab://create?type=Document&amp;outputMethod=xml" />

		... body is a XML WebLab document
	    </filter>
	</split>

WebLab Resource aggregation strategy

When you want do process a document with two branches in parallel that enrich this document. You need at the end to merge the resulting document of the two sub-chains in order to render the real single document result.

TODO: This part has to be completed.

Custom Date data format conversion

Sometimes header properties might by Strings that represents a Date in a given format. That is the case for instance of HTTP Headers. However we need to parse these dates in order to write them using the Annotator of the WebLab component.

TODO: This part has to be completed.