Tika-normaliser/1.8
From WebLab Wiki
This service aims to integrate Tika into the Weblab platform. Tika is use to create Weblab resource by extracting the textual part from various kind of files such as HTML page, PDF file, Microsoft Office files, text file amongst others.
| Details | |
|---|---|
| Service Interfaces | Analyser |
| Exchange model: | WebLab 1.2.2 |
| Versions: |
|
| Licence | LGPL 2.1 |
| Supported OS | Windows/Linux/MacOs |
| Integrated COTS | Tika |
| Binary | tika-normaliser-1.8.war |
| Sources | tika-normaliser-1.8-sources.jar |
| Javadoc | tika-normaliser-1.8-javadoc.jar |
| SVN | tika-normaliser |
| Maven Artifact | |
|
<groupId>org.ow2.weblab.webservices</groupId> <artifactId>tika-normaliser</artifactId> <version>1.8</version> | |
| Release Note | |
Contents |
Configuration
Configuration is achieved through 3 files located under the web application folder in WEB-INF/classes:
- log4j.properties : allowing to change the logging facility
- contentManager.properties : configuring the way Tika Normalisation service accesses the file pointed out by hasNativeContent annotation
- fileContentManager.properties : configuring the default implementation for contentManager which try to retrieve the native content by file, FTP, HTTP and HTTPS URLs
UsageContext effects
As this service is stateless, the UsageContext has no effect.
Examples of SOAP Input/Output
<soapenv:Envelope xmlns:ns8="http://weblab.ow2.org/core/1.2/model#" xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:anal="http://weblab.ow2.org/core/1.2/services/analyser"> <soapenv:Header/> <soapenv:Body> <anal:processArgs xmlns:ns2="http://weblab.ow2.org/core/1.2/services/resourcecontainer" xmlns:ns4="http://weblab.ow2.org/core/1.2/services/configurable" xmlns:ns3="http://weblab.ow2.org/core/1.2/services/trainable" xmlns:ns9="http://weblab.ow2.org/core/1.2/services/exception" xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:ns12="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns5="http://weblab.ow2.org/core/1.2/services/reportprovider" xmlns:ns6="http://weblab.ow2.org/core/1.2/model#" xmlns:ns7="http://weblab.ow2.org/core/1.2/services/analyser" xmlns:ns10="http://weblab.ow2.org/core/1.2/services/searcher" xmlns:ns8="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns11="http://weblab.ow2.org/core/1.2/services/indexer"> <resource xsi:type="ns6:Document" uri="weblab://fileRepository/1326441871065/res_940" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <annotation uri="weblab://fileRepository/1326441871065/res_940#a0"> <data xmlns:ns2="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns4="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns3="http://weblab.ow2.org/core/1.2/services/configurable"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#"> <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"> <wp:hasNativeContent rdf:resource="http://perdu.com"/> <wp:hasOriginalFileName>index.html</wp:hasOriginalFileName> <wp:hasOriginalFileSize>163</wp:hasOriginalFileSize> <dcterms:extent>163 bytes</dcterms:extent> <dcterms:modified>2010-03-02T19:52:21+0100</dcterms:modified> <wp:hasGatheringDate>2011-10-15T00:11:00+0200</wp:hasGatheringDate> <dc:source>http://perdu.com</dc:source> <dc:title>Vous Etes Perdu</dc:title> </rdf:Description> </rdf:RDF> </data> </annotation> </resource> </anal:processArgs> </soapenv:Body> </soapenv:Envelope>
According to configuration, the Tika Normaliser service will download the file specified by the hasNativeContent annotation and normalize it as follows :
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/"> <soap:Body> <ns4:processReturn xmlns:ns2="http://weblab.ow2.org/core/1.2/services/exception" xmlns:ns3="http://weblab.ow2.org/core/1.2/services/trainable" xmlns:ns4="http://weblab.ow2.org/core/1.2/services/analyser" xmlns:ns5="http://weblab.ow2.org/core/1.2/services/resourcecontainer" xmlns:ns6="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns7="http://weblab.ow2.org/core/1.2/services/searcher" xmlns:ns8="http://weblab.ow2.org/core/1.2/services/configurable" xmlns:wl="http://weblab.ow2.org/core/1.2/model#" xmlns:ns10="http://weblab.ow2.org/core/1.2/services/reportprovider" xmlns:ns11="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns12="http://weblab.ow2.org/core/1.2/services/indexer"> <resource xsi:type="wl:Document" uri="weblab://fileRepository/1326441871065/res_940" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <annotation uri="weblab://fileRepository/1326441871065/res_940#a0"> <data xmlns:ns10="http://weblab.ow2.org/core/1.2/services/searcher" xmlns:ns11="http://weblab.ow2.org/core/1.2/services/indexer" xmlns:ns12="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns2="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns3="http://weblab.ow2.org/core/1.2/services/configurable" xmlns:ns4="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns5="http://weblab.ow2.org/core/1.2/services/reportprovider" xmlns:ns6="http://weblab.ow2.org/core/1.2/model#" xmlns:ns7="http://weblab.ow2.org/core/1.2/services/analyser" xmlns:ns8="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns9="http://weblab.ow2.org/core/1.2/services/exception"> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#"> <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/"> <wp:hasNativeContent rdf:resource="http://perdu.com"/> <wp:hasOriginalFileName>index.html</wp:hasOriginalFileName> <wp:hasOriginalFileSize>163</wp:hasOriginalFileSize> <dcterms:extent>163 bytes</dcterms:extent> <dcterms:modified>2010-03-02T19:52:21+0100</dcterms:modified> <wp:hasGatheringDate>2011-10-15T00:11:00+0200</wp:hasGatheringDate> <dc:source>http://perdu.com</dc:source> <dc:title>Vous Etes Perdu</dc:title> <wp:hasNormalisedContent rdf:resource="file:/home/fred/Dev/apache-tomcat-6.0.20/bin/data/content/weblab.7127726849319234860.content"/> </rdf:Description> </rdf:RDF> </data> </annotation> <annotation uri="weblab://fileRepository/1326441871065/res_940#a4"> <data> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dct="http://purl.org/dc/terms/" xmlns:wlp="http://weblab.ow2.org/core/1.2/ontology/processing#" xmlns:wlr="http://weblab.ow2.org/core/1.2/ontology/retrieval#"> <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940#a4"> <dct:created>2012-03-29T15:20:59.625+02:00</dct:created> <wlp:isProducedBy rdf:resource="http://weblab.ow2.org/service/normaliser/tika"/> </rdf:Description> <rdf:Description rdf:about="weblab://fileRepository/1326441871065/res_940"> <dc:title>Vous Etes Perdu ?</dc:title> </rdf:Description> </rdf:RDF> </data> </annotation> <mediaUnit xsi:type="wl:Text" uri="weblab://fileRepository/1326441871065/res_940#3"> <content>Perdu sur l'Internet ? Pas de panique, on va vous aider *</content> </mediaUnit> </resource> </ns4:processReturn> </soap:Body> </soap:Envelope>
Known Limitations
Dependencies
List off all dependencies of this service:
org.ow2.weblab.webservices:tika-normaliser:war:1.8 +- org.ow2.weblab.core:model:jar:1.2.2:compile +- org.ow2.weblab.core:extended:jar:1.2.2:compile +- org.ow2.weblab.components:content-manager:jar:1.8.4:compile | \- commons-io:commons-io:jar:2.0.1:compile +- org.ow2.weblab.core.helpers:rdf-helper-jena:jar:1.3.2:compile | \- com.hp.hpl.jena:jena:jar:2.6.4:compile | +- com.hp.hpl.jena:iri:jar:0.8:compile | +- xerces:xercesImpl:jar:2.7.1:compile | +- org.slf4j:slf4j-api:jar:1.5.8:compile | +- org.slf4j:slf4j-log4j12:jar:1.5.8:runtime | \- log4j:log4j:jar:1.2.13:runtime +- org.apache.tika:tika-core:jar:1.0:compile +- org.apache.tika:tika-parsers:jar:1.0:compile | +- edu.ucar:netcdf:jar:4.2-min:compile | +- org.apache.james:apache-mime4j-core:jar:0.7:compile | +- org.apache.james:apache-mime4j-dom:jar:0.7:compile | +- org.apache.commons:commons-compress:jar:1.2:compile (version managed from 1.3) | +- commons-codec:commons-codec:jar:1.5:compile | +- org.apache.pdfbox:pdfbox:jar:1.6.0:compile | | +- org.apache.pdfbox:fontbox:jar:1.6.0:compile | | \- org.apache.pdfbox:jempbox:jar:1.6.0:compile | +- org.bouncycastle:bcmail-jdk15:jar:1.45:compile | +- org.bouncycastle:bcprov-jdk15:jar:1.45:compile | +- org.apache.poi:poi:jar:3.8-beta4:compile | +- org.apache.poi:poi-scratchpad:jar:3.8-beta4:compile | +- org.apache.poi:poi-ooxml:jar:3.8-beta4:compile | | +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta4:compile | | | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile | | \- dom4j:dom4j:jar:1.6.1:compile | +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile | +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile | +- asm:asm:jar:3.1:compile | +- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile | +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile | \- rome:rome:jar:0.9:compile | \- jdom:jdom:jar:1.0:compile +- com.ibm.icu:icu4j:jar:3.8:runtime (scope not updated to compile) +- javax.mail:mail:jar:1.4.1:runtime | \- javax.activation:activation:jar:1.1:runtime +- org.ow2.weblab.core:annotator:jar:1.2.4:compile | \- joda-time:joda-time:jar:1.6.2:compile +- org.apache.cxf:cxf-rt-frontend-jaxws:jar:2.4.0:compile | +- xml-resolver:xml-resolver:jar:1.2:compile | +- org.apache.cxf:cxf-api:jar:2.4.0:compile | | +- org.apache.cxf:cxf-common-utilities:jar:2.4.0:compile | | +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0:compile | | \- org.apache.neethi:neethi:jar:3.0.0:compile | | +- wsdl4j:wsdl4j:jar:1.6.2:compile | | \- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.1:compile | | \- org.codehaus.woodstox:stax2-api:jar:3.0.2:compile | +- org.apache.cxf:cxf-rt-core:jar:2.4.0:compile | | +- com.sun.xml.bind:jaxb-impl:jar:2.1.13:compile | | \- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile | +- org.apache.cxf:cxf-rt-bindings-soap:jar:2.4.0:compile | | +- org.apache.cxf:cxf-tools-common:jar:2.4.0:compile | | \- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.4.0:compile | +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.4.0:compile | +- org.apache.cxf:cxf-rt-frontend-simple:jar:2.4.0:compile | \- org.apache.cxf:cxf-rt-ws-addr:jar:2.4.0:compile +- org.apache.cxf:cxf-rt-transports-http:jar:2.4.0:compile | +- org.apache.cxf:cxf-rt-transports-common:jar:2.4.0:compile | \- org.springframework:spring-web:jar:3.0.5.RELEASE:compile | +- aopalliance:aopalliance:jar:1.0:compile | +- org.springframework:spring-beans:jar:3.0.5.RELEASE:compile | +- org.springframework:spring-context:jar:3.0.5.RELEASE:compile | | +- org.springframework:spring-aop:jar:3.0.5.RELEASE:compile | | +- org.springframework:spring-expression:jar:3.0.5.RELEASE:compile | | \- org.springframework:spring-asm:jar:3.0.5.RELEASE:compile | \- org.springframework:spring-core:jar:3.0.5.RELEASE:compile +- xalan:xalan:jar:2.7.1:compile | \- xalan:serializer:jar:2.7.1:compile | \- xml-apis:xml-apis:jar:1.3.04:compile +- commons-logging:commons-logging:jar:1.1.1:compile +- junit:junit:jar:4.8.2:test \- javax.servlet:servlet-api:jar:2.4:provided

