Tika-normaliser/1.9.1

From WebLab Wiki
Jump to: navigation, search
Tika Normalisation service
Details
Service Interfaces Analyser
Exchange model: WebLab 1.2.5
Versions: <ListSubPages />
Licence LGPL 2.1
Supported OS Windows/Linux/MacOs
Integrated COTS Tika
Binary tika-normaliser-1.9.1.war
Sources tika-normaliser-1.9.1-sources.jar
Javadoc tika-normaliser-1.9.1-javadoc.jar
SVN tika-normaliser
Maven Artifact

<groupId>org.ow2.weblab.webservices</groupId>

<artifactId>tika-normaliser</artifactId>

<version>1.9.1</version>
Release Note


This service wraps Tika as an Analyser service into the Weblab platform. Tika is use to create Weblab resources by extracting the textual part from various kind of files such as HTML page, PDF file, Microsoft Office files, text file amongst others.

Configuration

Two main kind of configuration can be achieved through on this service.

First, 2 files located under the web application folder in WEB-INF/classes can be used to configure the WebLab ContentManager:

  • contentManager.properties : configuring the way Tika Normalisation service accesses the file pointed out by hasNativeContent annotation. This files defines ContentManger implementation to be used.
  • fileContentManager.properties : Since contentManager.properties states that the FileContentManager should be used, this file configures this implementation by stating where files should be found.

Then, the most important configuration can be done through the use of the cxf-servlet.xml for dependency injection. Decription of the various properties in the comment of the xml snippet below.

	<bean name="theConf" id="theConf" class="org.ow2.weblab.service.normaliser.tika.TikaConfiguration" scope="prototype">
		<!-- The path to Tika XML configuration file. Null value means use the default Tika configuration. -->
		<property name="pathToXmlConfigurationFile">
			<null />
		</property>
		<!-- Whether or not to annotate the WebLab document with metadata extracted by Tika. Tika properties are converted into WebLab properties. -->
		<property name="addMetadata" value="true" />
		<!-- Whether or not to annotate the WebLab document with metadata extracted by Tika that cannot be mapped to a WebLab ontology. -->
		<property name="addUnmappedProperties" value="false" />
		<!-- Whether or not to annotate the WebLab document with the language guessed by Tika. -->
		<property name="annotateDocumentWithLang" value="false" />
		<!-- The default language to be used if Tika is not able to guess it. Null value means no annotation. -->
		<property name="defaultLang">
			<null />
		</property>
		<!-- Whether or not to generate an Html output, saved using the content manager -->
		<property name="generateHtml" value="true" />
		<!-- Whether the properties guess should be added if the Document already contains the same property. -->
		<property name="overrideMetadata" value="false" />
		<!-- If the native content is a temporary one (it is not the file content manager), whether to remove it. -->
		<property name="removeTempContent" value="false" />
		<!-- The URI of the service used for producedBy annotations. If none, nothing will be added. -->
		<property name="serviceUri" value="http://weblab.ow2.org/service/normaliser/tika" />
		<!-- The property to be used to annotate unmapped properties (if addUnmappedProperties). The name of the property in Tika is cleaned and added at the end of this base URI. -->
		<property name="unmappedPropertiesBaseUri" value="http://weblab.ow2.org/property/tika#" />
		<!-- The prefix to be used in the RDFHelper to have a beautiful RDF. -->
		<property name="unmappedPropertiesPrefix" value="tika" />
		<!-- The class used to generate the WebLab document given the XHTML events in Tika. -->
		<property name="webLabHandlerDecoratorClass" value="org.ow2.weblab.service.normaliser.tika.handlers.SimpleTextContentHandler" />
		<!-- The class used to write Tika Metadata into Weblab ontology. -->
		<property name="metadataWriterClass" value="org.ow2.weblab.service.normaliser.tika.metadatawriter.DefaultMetaDataWriter" />
	</bean>


UsageContext effects

As this service is stateless, the UsageContext has no effect.

Examples of SOAP Input/Output

<soapenv:Envelope xmlns:ns8="http://weblab.ow2.org/core/1.2/model#" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:anal="http://weblab.ow2.org/core/1.2/services/analyser">
   <soapenv:Header/>
   <soapenv:Body>
      <anal:processArgs xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns6="http://weblab.ow2.org/core/1.2/model#" xmlns:ns7="http://weblab.ow2.org/core/1.2/services/analyser">
         <resource xsi:type="ns6:Document" uri="weblab://fileRepository/1326441871065/res_940" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <annotation uri="weblab://fileRepository/1326441871065/res_940#a0">
               <data>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                     <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
                        <wp:hasNativeContent rdf:resource="http://perdu.com"/>
                        <wp:hasOriginalFileName>index.html</wp:hasOriginalFileName>
                        <wp:hasOriginalFileSize>163</wp:hasOriginalFileSize>
                        <dcterms:extent>163 bytes</dcterms:extent>
                        <dcterms:modified>2010-03-02T19:52:21+0100</dcterms:modified>
                        <wp:hasGatheringDate>2011-10-15T00:11:00+0200</wp:hasGatheringDate>
                        <dc:source>http://perdu.com</dc:source>
                        <dc:title>Vous Etes Perdu</dc:title>
                     </rdf:Description>
                  </rdf:RDF>
               </data>
            </annotation>
         </resource>
      </anal:processArgs>
   </soapenv:Body>
</soapenv:Envelope>

According to configuration, the Tika Normaliser service will download the file specified by the hasNativeContent annotation and normalise it as follows :

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
   <soap:Body>
      <ns4:processReturn xmlns:ns4="http://weblab.ow2.org/core/1.2/services/analyser" xmlns:wl="http://weblab.ow2.org/core/1.2/model#">
         <resource xsi:type="wl:Document" uri="weblab://fileRepository/1326441871065/res_940" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <annotation uri="weblab://fileRepository/1326441871065/res_940#a0">
               <data>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                     <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
                        <wp:hasNativeContent rdf:resource="http://perdu.com"/>
                        <wp:hasOriginalFileName>index.html</wp:hasOriginalFileName>
                        <wp:hasOriginalFileSize>163</wp:hasOriginalFileSize>
                        <dcterms:extent>163 bytes</dcterms:extent>
                        <dcterms:modified>2010-03-02T19:52:21+0100</dcterms:modified>
                        <wp:hasGatheringDate>2011-10-15T00:11:00+0200</wp:hasGatheringDate>
                        <dc:source>http://perdu.com</dc:source>
                        <dc:title>Vous Etes Perdu</dc:title>
                        <wp:hasNormalisedContent rdf:resource="file:/home/fred/Dev/apache-tomcat-6.0.20/bin/data/content/weblab.7127726849319234860.content"/>
                     </rdf:Description>
                  </rdf:RDF>
               </data>
            </annotation>
            <annotation uri="weblab://fileRepository/1326441871065/res_940#a4">
               <data>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:wlp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                     <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940#a4">
                        <dct:created>2012-03-29T15:20:59.625+02:00</dct:created>
                        <wlp:isProducedBy rdf:resource="http://weblab.ow2.org/service/normaliser/tika"/>
                     </rdf:Description>
                     <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940">
                        <dc:title>Vous Etes Perdu ?</dc:title>
                     </rdf:Description>
                  </rdf:RDF>
               </data>
            </annotation>
            <mediaUnit xsi:type="wl:Text" uri="weblab://fileRepository/1326441871065/res_940#3">
               <content>Perdu sur l'Internet ? Pas de panique, on va vous aider *</content>
            </mediaUnit>
         </resource>
      </ns4:processReturn>
   </soap:Body>
</soap:Envelope>

Known Limitations

  • WEBLAB-897 - OverrideMetadata configuration flag is unused
  • WEBLAB-462 - It would be better if structure information were annotated on the resulting resource
  • WEBLAB-1020 - By default an HTML version of the original content is generated which is useless in most of the cases
  • WEBLAB-1010 - Current implementation relies 1.1 version of Tika while 1.4 has been released yet
  • WEBLAB-263 - The mapping of tika internal metadata model to ontology based WebLab metadata model should be improved
  • WEBLAB-1022 - Tika does not properly produce text from metadata inclusion inside office documents
  • WEBLAB-1019 - Tika does not keep new lines from text/plain


Dependencies

List off all dependencies of this service:

org.ow2.weblab.webservices:tika-normaliser:war:1.9.1
+- org.ow2.weblab.core:model:jar:1.2.5:compile
+- org.ow2.weblab.core:extended:jar:1.2.5:compile
|  \- commons-io:commons-io:jar:2.3:compile
+- org.ow2.weblab.components:content-manager:jar:2.0.0:compile
+- org.ow2.weblab.core.helpers:rdf-helper-jena:jar:1.4.0:compile
|  \- org.apache.jena:jena-core:jar:2.7.2:compile
|     +- org.apache.jena:jena-iri:jar:0.9.2:compile
|     +- xerces:xercesImpl:jar:2.9.0:compile
|     +- org.slf4j:slf4j-api:jar:1.6.4:compile
|     +- org.slf4j:slf4j-log4j12:jar:1.6.4:compile
|     \- log4j:log4j:jar:1.2.16:compile
+- org.apache.tika:tika-core:jar:1.1:compile
+- org.apache.tika:tika-parsers:jar:1.1:compile
|  +- org.gagravarr:vorbis-java-tika:jar:0.1:compile
|  |  \- org.gagravarr:vorbis-java-core:jar:tests:0.1:test,provided
|  +- edu.ucar:netcdf:jar:4.2-min:compile
|  +- org.apache.james:apache-mime4j-core:jar:0.7:compile
|  +- org.apache.james:apache-mime4j-dom:jar:0.7:compile
|  +- org.apache.commons:commons-compress:jar:1.4.1:compile
|  |  \- org.tukaani:xz:jar:1.0:compile
|  +- commons-codec:commons-codec:jar:1.7:compile
|  +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
|  |  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
|  |  \- org.apache.pdfbox:jempbox:jar:1.6.0:compile
|  +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
|  +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
|  +- org.apache.poi:poi:jar:3.8-beta5:compile
|  +- org.apache.poi:poi-scratchpad:jar:3.8-beta5:compile
|  +- org.apache.poi:poi-ooxml:jar:3.8-beta5:compile
|  |  +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta5:compile
|  |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
|  |  \- dom4j:dom4j:jar:1.6.1:compile
|  +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
|  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
|  +- asm:asm:jar:3.1:compile
|  +- com.googlecode.mp4parser:isoparser:jar:1.0-beta-5:compile
|  |  \- net.sf.scannotation:scannotation:jar:1.0.2:compile
|  |     \- javassist:javassist:jar:3.6.0.GA:compile
|  +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
|  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
|  +- rome:rome:jar:0.9:compile
|  |  \- jdom:jdom:jar:1.0:compile
|  \- org.gagravarr:vorbis-java-core:jar:0.1:compile
+- com.ibm.icu:icu4j:jar:3.8:runtime
+- javax.mail:mail:jar:1.4.5:runtime
|  \- javax.activation:activation:jar:1.1:runtime
+- org.ow2.weblab.core:annotator:jar:1.2.5:compile
|  \- joda-time:joda-time:jar:2.1:compile
+- org.apache.cxf:cxf-rt-frontend-jaxws:jar:2.6.3:compile
|  +- xml-resolver:xml-resolver:jar:1.2:compile
|  +- org.apache.cxf:cxf-api:jar:2.6.3:compile
|  |  +- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.4:runtime
|  |  |  \- org.codehaus.woodstox:stax2-api:jar:3.1.1:runtime
|  |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.3:compile
|  |  +- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile
|  |  \- wsdl4j:wsdl4j:jar:1.6.2:compile
|  +- org.apache.cxf:cxf-rt-core:jar:2.6.3:compile
|  |  \- com.sun.xml.bind:jaxb-impl:jar:2.2.5:compile
|  +- org.apache.cxf:cxf-rt-bindings-soap:jar:2.6.3:compile
|  |  \- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.6.3:compile
|  +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.6.3:compile
|  +- org.apache.cxf:cxf-rt-frontend-simple:jar:2.6.3:compile
|  \- org.apache.cxf:cxf-rt-ws-addr:jar:2.6.3:compile
|     \- org.apache.cxf:cxf-rt-ws-policy:jar:2.6.3:compile
|        \- org.apache.neethi:neethi:jar:3.0.2:compile
+- org.apache.cxf:cxf-rt-transports-http:jar:2.6.3:compile
+- org.apache.cxf:cxf-rt-management:jar:2.6.3:compile
+- org.springframework:spring-web:jar:3.0.7.RELEASE:compile
|  +- aopalliance:aopalliance:jar:1.0:compile
|  +- org.springframework:spring-beans:jar:3.0.7.RELEASE:compile
|  +- org.springframework:spring-context:jar:3.0.7.RELEASE:compile
|  |  +- org.springframework:spring-aop:jar:3.0.7.RELEASE:compile
|  |  +- org.springframework:spring-expression:jar:3.0.7.RELEASE:compile
|  |  \- org.springframework:spring-asm:jar:3.0.7.RELEASE:compile
|  \- org.springframework:spring-core:jar:3.0.7.RELEASE:compile
+- xalan:xalan:jar:2.7.1:compile
|  \- xalan:serializer:jar:2.7.1:compile
+- xml-apis:xml-apis:jar:1.3.04:compile
+- commons-logging:commons-logging:jar:1.1.1:compile
+- junit:junit:jar:4.10:test
|  \- org.hamcrest:hamcrest-core:jar:1.1:test
\- javax.servlet:servlet-api:jar:2.5:provided