Difference between revisions of "Camel-WebLab2.0.2"

From WebLab Wiki
Jump to navigationJump to search
(Bean support)
(WebLab Resource Factory)
Line 11: Line 11:
  
 
This component is only available as a Camel Producer. It allows the user to create a WebLab resource from other Camel Components or messages from other components. When possible, the body of the input message is used to set the content of the newly created resource. Incoming message body contents accepted by the producer as valid content are 'ByteArrayOutputStream', 'String', 'InputStream' and 'File'.
 
This component is only available as a Camel Producer. It allows the user to create a WebLab resource from other Camel Components or messages from other components. When possible, the body of the input message is used to set the content of the newly created resource. Incoming message body contents accepted by the producer as valid content are 'ByteArrayOutputStream', 'String', 'InputStream' and 'File'.
 +
  
 
====URI Format====
 
====URI Format====
Line 18: Line 19:
  
 
You can append query options to the URI in the following format, ?option=value&option=value&...
 
You can append query options to the URI in the following format, ?option=value&option=value&...
 +
  
 
====URI Options====
 
====URI Options====
Line 56: Line 58:
 
|When a custom property is used to annotate a resource, the corresponding prefix must be defined in a map and passed to the WebLab producer by this option.
 
|When a custom property is used to annotate a resource, the corresponding prefix must be defined in a map and passed to the WebLab producer by this option.
 
|}
 
|}
 +
  
 
====Message Headers====
 
====Message Headers====
Line 66: Line 69:
 
|To annotate the created WebLab resource with the property defined in the header. The complete property URI is defined by the '''prefix''' and '''property''' parts of the header name. The header value corresponds to the property value. If the value is language dependant, the optional '''@lang''' part of the header name is used to set the annotation language attribute ([http://www.loc.gov/standards/iso639-2/php/code_list.php ISO 639 codes]). Default [[WebLab_Model_Annotator|WebLab annotators]] are used to annotate the resource. Predefined prefixes are '''wlm''', '''wlp''', '''rdf''', '''rdfs''', '''dc''' and '''dct''' corresponding respectively to [http://weblab.ow2.org/core/1.2/ontology/model# model], [http://weblab.ow2.org/core/1.2/ontology/processing# processing], [http://www.w3.org/1999/02/22-rdf-syntax-ns# rdf], [http://www.w3.org/2000/01/rdf-schema# rdfs], [http://purl.org/dc/elements/1.1/ dublinCore elements] and [http://purl.org/dc/terms/ dublinCore terms] namespaces. All other prefixes have to be defined and passed to the WebLab producer with the 'prefixMap' option. Note that the header value object's class is used to set the rdf datatype of produced RDF triples (for example the Date class is used to set rdf datatype to [http://www.w3.org/2001/XMLSchema#dateTime XMLSchema#dateTime]).
 
|To annotate the created WebLab resource with the property defined in the header. The complete property URI is defined by the '''prefix''' and '''property''' parts of the header name. The header value corresponds to the property value. If the value is language dependant, the optional '''@lang''' part of the header name is used to set the annotation language attribute ([http://www.loc.gov/standards/iso639-2/php/code_list.php ISO 639 codes]). Default [[WebLab_Model_Annotator|WebLab annotators]] are used to annotate the resource. Predefined prefixes are '''wlm''', '''wlp''', '''rdf''', '''rdfs''', '''dc''' and '''dct''' corresponding respectively to [http://weblab.ow2.org/core/1.2/ontology/model# model], [http://weblab.ow2.org/core/1.2/ontology/processing# processing], [http://www.w3.org/1999/02/22-rdf-syntax-ns# rdf], [http://www.w3.org/2000/01/rdf-schema# rdfs], [http://purl.org/dc/elements/1.1/ dublinCore elements] and [http://purl.org/dc/terms/ dublinCore terms] namespaces. All other prefixes have to be defined and passed to the WebLab producer with the 'prefixMap' option. Note that the header value object's class is used to set the rdf datatype of produced RDF triples (for example the Date class is used to set rdf datatype to [http://www.w3.org/2001/XMLSchema#dateTime XMLSchema#dateTime]).
 
|}
 
|}
 +
  
 
====Create a WebLab Text Resource====
 
====Create a WebLab Text Resource====
Line 82: Line 86:
 
       <!-- Now the body is a WebLab Text java object -->
 
       <!-- Now the body is a WebLab Text java object -->
 
</source>
 
</source>
 +
  
 
====To create a XML WebLab Text resource inside a Camel route====
 
====To create a XML WebLab Text resource inside a Camel route====
Line 97: Line 102:
 
       <!-- Now the body is an XML WebLab Text -->
 
       <!-- Now the body is an XML WebLab Text -->
 
</source>
 
</source>
 +
  
 
====To create an annotated XML WebLab resource inside a Camel route====
 
====To create an annotated XML WebLab resource inside a Camel route====
Line 124: Line 130:
 
       <!-- Now the body is an XML WebLab Text -->
 
       <!-- Now the body is an XML WebLab Text -->
 
</source>
 
</source>
 +
  
 
====To create an annotated WebLab resource inside a Camel route with custom properties ====
 
====To create an annotated WebLab resource inside a Camel route with custom properties ====
Line 164: Line 171:
 
        
 
        
 
</source>
 
</source>
 +
  
 
====Bean support====
 
====Bean support====

Revision as of 09:43, 2 April 2015

The Camel-WebLab component provides:

  • access to the WebLab Document factory to create a WebLab Document from other Camel Components or messages from other components.
  • access to the OSGi registry allowing documents to be processed by WebLab services referenced as OSGi services.
  • a warc Splitter and an Aggregation Strategy to merge WebLab resources.
  • some custom types that can be used to convert dates metadata.

The Camel-WebLab component is used in the ESB.


WebLab Resource Factory

This component is only available as a Camel Producer. It allows the user to create a WebLab resource from other Camel Components or messages from other components. When possible, the body of the input message is used to set the content of the newly created resource. Incoming message body contents accepted by the producer as valid content are 'ByteArrayOutputStream', 'String', 'InputStream' and 'File'.


URI Format

weblab:create[?options]

or

weblab://create[?options]

You can append query options to the URI in the following format, ?option=value&option=value&...


URI Options

Producer

Name Default Value Description
type null REQUIRED. The WebLab type of the produced resource. The expected value is a compatible WebLab resource class (Document, Text, Image, Audio, Video, Resource, Annotation, PieceOfKnowledge). The WebLab type should be defined in accordance with the input message content type. Note that no content verification is done by the producer so, for example, an image (byte array) can be set as native content of a Text resource without error.
writeContent false Write the message body content as the unit embedded content. The default behaviour of the producer is to write the passed input exchange message body as the native content. In this case, the passed input exchange message body is written by the ContentManager and the hasNativeContent URI is added to the output WebLab resource as an annotationt. If writeContent is set to true, the passed content (input exchange message body) is directly added to created WebLab MediaUnit content part.
outputMethod null The default behaviour of the producer is to return the created WebLab resource as a Java object. When the outputMethod is set to XML, the WebLab resource is returned as an XML document.
streaming false When the output method is set to XML, the default behaviour of the producer is to marshall the created WebLab resource into a String. When the streaming is set to true, the producer marshalls the created WebLab resource into a stream.
filterNonXMLChars true When the type option is set to Text, the default behaviour of the producer is to skip non XML chars (see XMLChars) in the text content.
documentWrapped false When the documentWrapped option is set to true, and the created resource is a Media Unit, then it is wrapped in a WebLab document instead to be returned as a Media Unit. In all cases, annotations will desribe always the created Media Unit (not the Wrapping document).
prefixMap A map initialized with prefix wlm, wlp, rdf, rdfs, dc and dct corresponding respectively to model, processing, rdf, rdfs, dublinCore elements and dublinCore terms namespaces When a custom property is used to annotate a resource, the corresponding prefix must be defined in a map and passed to the WebLab producer by this option.


Message Headers

Name Description
weblab:prefix:property[@lang=isoCode] To annotate the created WebLab resource with the property defined in the header. The complete property URI is defined by the prefix and property parts of the header name. The header value corresponds to the property value. If the value is language dependant, the optional @lang part of the header name is used to set the annotation language attribute (ISO 639 codes). Default WebLab annotators are used to annotate the resource. Predefined prefixes are wlm, wlp, rdf, rdfs, dc and dct corresponding respectively to model, processing, rdf, rdfs, dublinCore elements and dublinCore terms namespaces. All other prefixes have to be defined and passed to the WebLab producer with the 'prefixMap' option. Note that the header value object's class is used to set the rdf datatype of produced RDF triples (for example the Date class is used to set rdf datatype to XMLSchema#dateTime).


Create a WebLab Text Resource

The WebLab producer will create a WebLab text resource Java object as the exchange body :

      <!-- Set the body with text. This text will be used by the producer to create the resource content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>
       				
      <!-- To produce a Text with content embedded from body as a string -->
      <to uri="weblab://create?type=Text"/>
      
      <!-- Now the body is a WebLab Text java object -->


To create a XML WebLab Text resource inside a Camel route

The WebLab producer will create a WebLab Text resource XML representation as the exchange body :

      <!-- Set the body with text. This text will be used by the producer to create the resource content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>
       				
      <!-- To produce a Text with content embedded from body as a string -->
      <to uri="weblab://create?type=Text&amp;outputMethod=xml"/>
      
      <!-- Now the body is an XML WebLab Text -->


To create an annotated XML WebLab resource inside a Camel route

The WebLab producer will create an annotated WebLab Text resource XML representation as the exchange body. Here, the Text will be annotated with :

      <!-- Set the body with text. This text will be used by the producer to create the resource content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>

      <!-- Set properties used by the producer to annotate the created resource -->
      <setHeader headerName="weblab:dct:created">
         <simple resultType="java.util.Date">${date:now:YYYY-MM-dd'T'hh:mm:ss}</simple>
      </setHeader>
      <setHeader headerName="weblab:dc:title@lang=en">
     	 <simple>The title of my resource</simple>
      </setHeader>
 				
      <!-- To produce a Text with content embedded from body and annotations from headers -->
      <to uri="weblab://create?type=Text&amp;outputMethod=xml"/>
      
      <!-- Now the body is an XML WebLab Text -->


To create an annotated WebLab resource inside a Camel route with custom properties

The WebLab producer will create an annotated WebLab resource with content from the exchange body and annotations from headers. Here, the Document will be annotated with :

Define a prefix map bean outside of the camel context (in the spring context) :

      <util:map id="myPrefixMap">
	<!-- standard ontologies prefix (if needed) -->
	<entry key="dc" value="http://purl.org/dc/elements/1.1/"/>
		
	<!-- custom ontologies prefix -->
	<entry key="myOntoPrefix" value="http://mySpecialOntology#"/>
      </util:map>

Call the producer inside a route by referencing the defined prefix map :

      <!-- Set the body with text. This text will be used by the producer to create the resource native content -->
      <setBody>
         <simple>This is the text content of my resource.</simple>
      </setBody>

      <!-- Set properties used by the producer to annotate the created resource -->
      <setHeader headerName="weblab:dc:title@lang=en">
         <constant>The title of my resource</constant>
      </setHeader>
      <setHeader headerName="weblab:myOntoPrefix:mySpecialProperty@lang=fr">
	 <constant>ma valeur</constant>
      </setHeader>
 				
      <!-- To produce an annotated Document with native content from body and annotations from headers -->
      <to uri="weblab://create?type=Document&amp;prefixMap=#myPrefixMap" />


Bean support

Camel-weblab may use WebLab Annotations to create or enrich a document.

WebLab Service Registry

This component is only available as a Camel Producer. It allows the user to call a WebLab service defined as an OSGi service reference. Thus, it supports distributed services: you can add and remove WebLab services any time and reuse services already deployed for your own new routes.

URI Format

weblab:service-interface:service-name[?options]

or

weblab://service-interface:service-name[?options]

Where service-interface represents one of the supported WebLab Service Interfaces:

  • analyser
  • indexer
  • resourceSaver

and where service-name corresponds to the OSGI service name property of the WebLab service to request.

You can append query options to the URI in the following format, ?option=value&option=value&...

URI Options

Producer

Name Default Value Description
filter null It defines which OSGi filter must be used to search for the service in the OSGi registry. By default the filter is (name=service-name).

=>You have to URL encode the filter, see how to select a WebLab service with an OSGi filter for more details about its usage.

type null The default behaviour of the producer is to search for the service-name in the OSGi service registry, if type=direct then service-name will not be looked for in the OSGi registry, filter options will be ignored and it will be called as a Camel endpoint. See how to call WebLab service directly for more details about its usage.

Register and call a WebLab service as an OSGi service

WebLab services must be registered as an osgi service reference in order to be available from the WebLab endpoint.

To register a WebLab service my-service as an OSGi service reference with the service name service-my-service:

	<!-- CXF Endpoint referring to an analyser service deployed on a Web Application server at http://localhost:8181/my-service/analyser -->	
	<cxf:cxfEndpoint id="my-service"
		address="http://localhost:8181/my-service/analyser"
		serviceClass="org.ow2.weblab.core.services.Analyser">
		<cxf:properties>
			<entry key="dataFormat" value="PAYLOAD" />
			<entry key="allowStreaming" value="true" />
		</cxf:properties>
	</cxf:cxfEndpoint>

	<!-- The OSGi service referencing the previous CXF Endpoint -->
	<osgi:service id="osgi-my-service" ref="my-service"  interface="org.apache.camel.Endpoint" >
		<osgi:service-properties>
			<!-- We will use this name when we refer to the service in Camel routes -->
   			<entry key="name" value="service-my-service"/>
		</osgi:service-properties>
	</osgi:service>

To call the previous WebLab service from any Camel route, you MUST refer to its service name (service-my-service):

	<!-- a timer will call the route only once -->
	<from uri="timer:callOnlyOnce?repeatCount=1" />

	<!-- we create an empty document -->
	<to uri="weblab://create?type=Document" />

	<!-- we send it to the OSGi service who's name property is 'service-my-service'  -->
	<to uri="weblab:analyser:service-my-service" />

More details and examples are given in the tutorial on how to create a WebLab processing chain

Use OSGi filters to find and to call a WebLab service

The option filter allows you to define filters instead of using the service name property.

The following example will select a service who's property classification is define as named-entity-extraction:

	<from uri="direct:start" />
	<to uri="weblab://create?type=Document" />
	<to uri="weblab:analyser?filter=(classification=named-entity-extraction)" />

You can use the OSGi filter syntax to define more complex filters, e.g. you may want to search for a service by its classification and its name, however you need to encode the filter value in the URI:

	<from uri="direct:start" />
	<to uri="weblab://create?type=Document" />
	<to uri="weblab:analyser?filter=(%26(name=service-gate)(classification=named-entity-extraction))" />

Request a WebLab service directly (without an OSGi service name or filter)

You may want to call a WebLab service without sharing it as an OSGi service, for example: a new service for integration test purposes.

The option type=direct allows you to call a WebLab Service without registering it as an OSGi service reference first. When using type=direct, the service-name must be an URI describing the Camel Endpoint of the service (for example: cxf:bean:my-cxf-endpoint).

E.g. to define the use of a Gazetteer service exposed on a Tomcat server, you need to define a CXF Endpoint and refer to it using a Camel URI (cxf:bean:gazetteer):

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:camel="http://camel.apache.org/schema/spring"
	xmlns:cxf="http://camel.apache.org/schema/cxf" xmlns:osgi="http://www.springframework.org/schema/osgi"
	xsi:schemaLocation="http://www.springframework.org/schema/beans 
         http://www.springframework.org/schema/beans/spring-beans-3.0.xsd 
         http://camel.apache.org/schema/spring 
         http://camel.apache.org/schema/spring/camel-spring.xsd
 	 http://www.springframework.org/schema/osgi  
       	 http://www.springframework.org/schema/osgi/spring-osgi.xsd
         http://camel.apache.org/schema/dataFormats 
         http://camel.apache.org/schema/spring/camel-spring.xsd
         http://cxf.apache.org/jaxws
         http://cxf.apache.org/schemas/jaxws.xsd
         http://camel.apache.org/schema/cxf 
         http://camel.apache.org/schema/cxf/camel-cxf.xsd">

	<!-- Define the CXF Endpoint of a Gazetteer service -->
	<cxf:cxfEndpoint id="gazetteer"
		address="http://localhost:8181/simple-gazetteer/simple-gazetteer"
		serviceClass="org.ow2.weblab.core.services.Analyser">
		<cxf:properties>
			<entry key="dataFormat" value="PAYLOAD" />
			<entry key="allowStreaming" value="true" />
		</cxf:properties>
	</cxf:cxfEndpoint>

	<!-- Create a Camel Context -->
	<camelContext id="DirectWebLabServiceCallCamelContext"  xmlns="http://camel.apache.org/schema/spring">

		<!-- With a Route -->
		<route id="DirectWebLabServiceCallRoute" streamCache="true" autoStartup="true">
			
			<!-- call it only once -->
			<from uri="timer:callOnlyOnce?repeatCount=1" />

			<!-- create an empty WebLab Document --> 
			<to uri="weblab://create?type=Document" />

			<!-- Call the WebLab endpoint with option type=direct referencing the Camel Endpoint URI: cxf:bean:gazetteer -->
			<to uri="weblab:analyser:cxf:bean:gazetteer?type=direct" />

			<!-- Final message -->
			<log message="Analyser Service simple-gazetteer processed the document"/>
		</route>

	</camelContext>

</beans>

See Also

Warcs Splitter

This component is only available as a Camel Splitter (split method). It allows to split warc files into seperate records as camel exchange messages. Each message body corresponds to a record content and message headers are filled with all available archive/record headers.

The split method have to be declared as a Spring Bean inside of a camel context.

Splitter bean declaration

	<bean id="warcSplitter" class="org.ow2.weblab.engine.camel.WarcSplitter">
		<property name="copyWarcInfoHeaders" value="true"/>
	</bean>

Splitter bean properties

Name Default Value Description
copyWarcInfoHeaders false Enable this option if you want to copy the archive headers informations to the exchange message header 'ArchiveInfoPayloadHeader' (see above).

Message Headers

Headers added to each message :

Name Description
ArchiveInfoPayloadHeader Optionnal header that contains warc headers of the archive file. Here an example of headers that could be descibed in this header : ArchiveInfoPayloadHeader={robots=obey, software=Heritrix/3.1.0 http://crawler.archive.org, http-header-user-agent=Mozilla/5.0 (compatible; heritrix/3.1.0 +http://mozilla.org), conformsTo=http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf, description=Basic crawl starting with useful defaults, isPartOf=basic, format=WARC File Format 1.0, ...}. This header is available only if the 'copyWarcInfoHeaders' property is set to 'true'.
ArchiveRecordHeader Header that contains warc header of the current record of the archive file. Here an example of headers that could be descibed in this header : ArchiveRecordHeader={WARC-Profile=http://netpreserve.org/warc/1.0/revisit/identical-payload-digest, WARC-Type=revisit, WARC-Truncated=length, WARC-Date=2013-12-22T00:15:13Z, Content-Length=397, WARC-Record-ID=<urn:uuid:481dcbd9-80c2-4545-b20c-fb851850c6d5>, WARC-Payload-Digest=sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ, Content-Type=application/http; msgtype=response, ...}
ArchiveRecordPayloadHeader Header that contains warc header of the current record payload of the archive file. Here an example of headers that could be descibed in this header : ArchiveRecordPayloadHeader={Age=69, ETag="345fd30f-409a-4eb11810acfd5", Last-Modified=Wed, 13 Nov 2013 16:31:58 GMT, Connection=close, Server=Apache/2.2.16 (Debian), Cache-Control=max-age=120, Date=Sun, 22 Dec 2013 01:39:24 GMT, Vary=Accept-Encoding, Content-Type=text/html;charset=utf-8, Accept-Ranges=bytes, ...}
ArchiveRecordHttpHeader Header that contains http headers of the current record of the archive file. Here an example of headers that could be descibed in this header : ArchiveRecordHttpHeader={HttpHeader : [HttpResultCode: 200, HttpProtocolVersion: HTTP/1.1, HttpContentType: text/html; charset=UTF-8]}


For more information about the header fields you can:

To split a warc file inside a camel route

Here an example of a warc file splitting inside of a camel route, where each html or pdf record is used to create a new WebLab document annotated with values from the record headers:

        ... body is a warc file

	<split>
	    <method bean="warcSplitter" method="splitMessage" />
						
	    <filter>
                <simple>${headers.ArchiveRecordHttpHeader.protocolStatusCode} range '200..300' and ${headers.ArchiveRecordPayloadHeader['Content-Type']} regex '(.*html.*)|(.*pdf.*)'</simple>

		<setHeader headerName="weblab:wlp:hasGatheringDate">
		    <simple resultType="java.util.Date">${headers.ArchiveRecordHeader['WARC-Date']}</simple>
		</setHeader>

		<setHeader headerName="weblab:wlp:hasOriginalFileSize">
		    <simple>${headers.ArchiveRecordPayloadHeader['Content-Length']}</simple>
		</setHeader>
		<setHeader headerName="weblab:dc:language">
		    <simple>${headers.ArchiveRecordPayloadHeader['Content-Language']}</simple>
		</setHeader>
							
		<!-- To produce a WebLab resource from the record content -->
		<to uri="weblab://create?type=Document&amp;outputMethod=xml" />

		... body is a XML WebLab document
	    </filter>
	</split>

WebLab Resource aggregation strategy

When you want do process a document with two branches in parallel that enrich this document. You need at the end to merge the resulting document of the two sub-chains in order to render the real single document result.

TODO: This part has to be completed.

Custom Date data format conversion

Sometimes header properties might by Strings that represents a Date in a given format. That is the case for instance of HTTP Headers. However we need to parse these dates in order to write them using the Annotator of the WebLab component.

TODO: This part has to be completed.