Ngramj-language-extraction/1.2.2

From WebLab Wiki
Jump to navigationJump to search
Ngramj Language Extraction component
Details
Service Interfaces Analyser
Exchange model: WebLab 1.2.2
Versions: <ListSubPages />
Licence LGPL 2.1
Supported OS Windows/Linux/MacOs
Integrated COTS NGramJ
Binary ngramj-language-extraction-1.2.2.war
Sources ngramj-language-extraction-1.2.2-sources.jar
Javadoc ngramj-language-extraction-1.2.2-javadoc.jar
SVN ngramj-language-extraction
Maven Artifact

<groupId>org.ow2.weblab.webservices</groupId>

<artifactId>ngramj-language-extraction</artifactId>

<version>1.2.2</version>
Release Note


This service can be used to identify the language of a Text.

It's a wrapper of the NGramJ project: http://ngramj.sourceforge.net/. It uses the CNGram system that relies on ngrams of characters to determine the language of a character sequence instead of instead of ngrams of bytes in raw text files.

For each input text, this algorithm returns a score associated with previously learnt language profiles (defined in .ngp files). The score is a double between '0' (the text is definitely not written in this language) and '1' (the text is definitely written in this language). The sum of the scores is '1'.

Our wrapper annotates every Text section of an input Document (or the Text if the input is Text) using the Dublin Core property DC:language. It fails if the input is something else (e.g. a video). For each Text it uses CNGram to determine which language profiles are the best candidate annotations.

Configuration

It can be configured using the CXF/Spring bean loading IoC pattern through constructorArgs. This class contains 5 constructors, that make recursive calls from the simple one to the more complete one by using default values for each non defined parameter. It contains the following constructors:

  • public LanguageExtraction()
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath)
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath, String isProducedByObject)
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath, String isProducedByObject, String unknownLanguageCode)
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath, String isProducedByObject, String unknownLanguageCode, double minSingleValue, double minMultipleValue)

The parameters that can be set are the following:

  • maxNbValues: It's a positive integer value. The list of annotated languages on a given Text will not be greater than this value.
  • addTopLevelAnnot: It's a boolean value. It defines whether or not to annotate the whole document with the language extracted from the concatenation of every Text content.
  • addMediaUnitLevelAnnot: It's a boolean value. It defines whether or not to annotate each Text section with the language guessed.
  • profilesFolderPath: It's a String that represents a folder path; This folder contains .ngp files that will be loaded instead of the default CNGram 28 languages.
  • isProducedByObject: It's a String value that should be a valid URI. It defines the URI to be used as the object of every isProducedBy statement on annotations created by the service.
  • unknownLanguageCode: It's the String value that will be annotated when no language can be clearly identified. When null, nothing is annotated.
  • minSingleValue: It's a double value between 0 and 1. If the best language score is greater than this value, it will be the only one annotated on a given Text.
  • minMultipleValue: It's a double value between 0 and 1. Every language that scores greater than this value will be annotated on a given Text.


These 8 properties are optional. Default values are:

  • maxNbValues: 1
  • addTopLevelAnnot: false
  • addTopLevelAnnot: true
  • profilesFolderPath: in this case, we use the default constructor to use the CNGram profiles given in their jar file. These 28 profiles are named using the ISO 639-1 two letter language codes; it means that the DC:language annotation resulting will be in this format. If you want to use another format, you have use a custom profiles folder (containing .ngp files).
  • isProducedByObject: null; in this case, no isProducedBy annotation will be created.
  • unknownLanguageCode: 'und' is used, since it is the code for undetermined in ISO-639-x.
  • minSingleValue: 0.75
  • minMultipleValue: 0.15

UsageContext effects

UsageContext is not used in this service.

Examples of SOAP Input/Output

Analyser:process

Input

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:anal="http://weblab.ow2.org/core/1.2/services/analyser">
   <soapenv:Header/>
   <soapenv:Body>
      <anal:processArgs>
         <resource xsi:type="model:Document" uri="weblab://SmallEnglishTest/1" xmlns:model="http://weblab.ow2.org/core/1.2/model#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <mediaUnit xsi:type="model:Text" uri="weblab://SmallEnglishTest/1#0">
               <content>WebLab: An integration infrastructure to ease the development of multimedia processing applications
Patrick GIROUX, Stephan BRUNESSAUX, Sylvie BRUNESSAUX, Jérémie DOUCY, Gérard DUPONT, Bruno GRILHERES, Yann MOMBRUN, Arnaud SAVAL
Information Processing Control and Cognition (IPCC)
EADS Defence and Security Systems
Parc d'Affaire des Portes
27106 Val de Reuil

      
								

http://weblab-project.org

  

ipcc@weblab-project.org


   

{patrick.giroux, stephan.brunessaux, sylvie.brunessaux, jeremie.doucy, gerard.dupont, bruno.grilheres, yann.mombrun, arnaud.saval}@eads.com



Abstract:
In this paper, we introduce the EADS' WebLab platform (http://weblab-project.org) that aims at providing an integration infrastructure for multimedia information processing components. In the following, we explain the motivations that have led to the realisation of this project within EADS and the requirements that have led our choices. After a quick review of existing information processing platforms, we present the chosen service oriented architecture, and the three layers of the WebLab project (infrastructure, services and applications).
Then, we detail the chosen exchange model and normalised services interfaces that enable semantic interoperability between information processing components. We present the technical choices made to guarantee technical interoperability between the components by the use of an Enterprise Service Bus (ESB).
Moreover, we present the orchestration and portal mechanisms that we have added to the WebLab to enable architects to quickly build multimedia processing applications. In the following, we illustrate the integration process by describing three applications that have been developed on top of this architecture on three R&amp;D projects (Vitalas, WebContent and eWok-Hub). Finally, we propose some perspectives such as the realisation of an information processing services directory, or a toolkit following MDA (Model Driven Architecture) approach to ease the integration process.



Keywords:
Integration infrastructure, Service Oriented Architecture, Semantics, Multimedia Information Processing Platform.</content>
            </mediaUnit>
         </resource>
      </anal:processArgs>
   </soapenv:Body>
</soapenv:Envelope>

Output

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
   <soap:Body>
      <ns6:processReturn xmlns:ns2="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns3="http://weblab.ow2.org/core/1.2/services/reportprovider" xmlns:ns4="http://weblab.ow2.org/core/1.2/services/configurable" xmlns:wl="http://weblab.ow2.org/core/1.2/model#" xmlns:ns6="http://weblab.ow2.org/core/1.2/services/analyser" xmlns:ns7="http://weblab.ow2.org/core/1.2/services/exception" xmlns:ns8="http://weblab.ow2.org/core/1.2/services/indexer" xmlns:ns9="http://weblab.ow2.org/core/1.2/services/trainable" xmlns:ns10="http://weblab.ow2.org/core/1.2/services/resourcecontainer" xmlns:ns11="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns12="http://weblab.ow2.org/core/1.2/services/searcher">
         <resource xsi:type="wl:Document" uri="weblab://SmallEnglishTest/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <annotation uri="weblab://SmallEnglishTest/1#a2">
               <data>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
                     <rdf:Description rdf:about="weblab://SmallEnglishTest/1">
                        <dc:language>en</dc:language>
                     </rdf:Description>
                     <rdf:Description rdf:about="weblab://SmallEnglishTest/1#a2" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                        <dcterms:created>2012-03-06T16:21:11+0100</dcterms:created>
                        <wp:isProducedBy rdf:resource="http://weblab.ow2.org/services#LanguageExtraction"/>
                     </rdf:Description>
                  </rdf:RDF>
               </data>
            </annotation>
            <mediaUnit xsi:type="wl:Text" uri="weblab://SmallEnglishTest/1#0">
               <annotation uri="weblab://SmallEnglishTest/1#0-a0">
                  <data>
                     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
                        <rdf:Description rdf:about="weblab://SmallEnglishTest/1#0">
                           <dc:language>en</dc:language>
                        </rdf:Description>
                        <rdf:Description rdf:about="weblab://SmallEnglishTest/1#0-a0" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                           <dcterms:created>2012-03-06T16:21:11+0100</dcterms:created>
                           <wp:isProducedBy rdf:resource="http://weblab.ow2.org/services#LanguageExtraction"/>
                        </rdf:Description>
                     </rdf:RDF>
                  </data>
               </annotation>
               <content>WebLab: An integration infrastructure to ease the development of multimedia processing applications
Patrick GIROUX, Stephan BRUNESSAUX, Sylvie BRUNESSAUX, Jérémie DOUCY, Gérard DUPONT, Bruno GRILHERES, Yann MOMBRUN, Arnaud SAVAL
Information Processing Control and Cognition (IPCC)
EADS Defence and Security Systems
Parc d'Affaire des Portes
27106 Val de Reuil

      
								

http://weblab-project.org

  

ipcc@weblab-project.org


   

{patrick.giroux, stephan.brunessaux, sylvie.brunessaux, jeremie.doucy, gerard.dupont, bruno.grilheres, yann.mombrun, arnaud.saval}@eads.com



Abstract:
In this paper, we introduce the EADS' WebLab platform (http://weblab-project.org) that aims at providing an integration infrastructure for multimedia information processing components. In the following, we explain the motivations that have led to the realisation of this project within EADS and the requirements that have led our choices. After a quick review of existing information processing platforms, we present the chosen service oriented architecture, and the three layers of the WebLab project (infrastructure, services and applications).
Then, we detail the chosen exchange model and normalised services interfaces that enable semantic interoperability between information processing components. We present the technical choices made to guarantee technical interoperability between the components by the use of an Enterprise Service Bus (ESB).
Moreover, we present the orchestration and portal mechanisms that we have added to the WebLab to enable architects to quickly build multimedia processing applications. In the following, we illustrate the integration process by describing three applications that have been developed on top of this architecture on three R&amp;D projects (Vitalas, WebContent and eWok-Hub). Finally, we propose some perspectives such as the realisation of an information processing services directory, or a toolkit following MDA (Model Driven Architecture) approach to ease the integration process.



Keywords:
Integration infrastructure, Service Oriented Architecture, Semantics, Multimedia Information Processing Platform.</content>
            </mediaUnit>
         </resource>
      </ns6:processReturn>
   </soap:Body>
</soap:Envelope>

Known Limitations

Dependencies

List off all dependencies of this service:

org.ow2.weblab.webservices:ngramj-language-extraction:war:1.2.2
+- de.spieleck.app.ngramj:cngram:jar:1.0-0.060327:compile
+- org.ow2.weblab.core.helpers:rdf-helper-jena:jar:1.3.2:test
|  \- com.hp.hpl.jena:jena:jar:2.6.4:test
|     +- com.hp.hpl.jena:iri:jar:0.8:test
|     +- com.ibm.icu:icu4j:jar:3.4.4:test
|     +- xerces:xercesImpl:jar:2.7.1:test
|     +- org.slf4j:slf4j-api:jar:1.5.8:test
|     +- org.slf4j:slf4j-log4j12:jar:1.5.8:test
|     \- log4j:log4j:jar:1.2.13:test
+- org.ow2.weblab.core:model:jar:1.2.2:compile
+- org.ow2.weblab.core:extended:jar:1.2.2:compile
+- org.ow2.weblab.core:annotator:jar:1.2.4:compile
|  \- joda-time:joda-time:jar:1.6.2:compile
+- org.apache.cxf:cxf-rt-frontend-jaxws:jar:2.4.0:compile
|  +- xml-resolver:xml-resolver:jar:1.2:compile
|  +- asm:asm:jar:3.3:compile
|  +- org.apache.cxf:cxf-api:jar:2.4.0:compile
|  |  +- org.apache.cxf:cxf-common-utilities:jar:2.4.0:compile
|  |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0:compile
|  |  \- org.apache.neethi:neethi:jar:3.0.0:compile
|  |     +- wsdl4j:wsdl4j:jar:1.6.2:compile
|  |     \- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.1:compile
|  |        \- org.codehaus.woodstox:stax2-api:jar:3.0.2:compile
|  +- org.apache.cxf:cxf-rt-core:jar:2.4.0:compile
|  |  +- com.sun.xml.bind:jaxb-impl:jar:2.1.13:compile
|  |  \- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile
|  +- org.apache.cxf:cxf-rt-bindings-soap:jar:2.4.0:compile
|  |  +- org.apache.cxf:cxf-tools-common:jar:2.4.0:compile
|  |  \- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.4.0:compile
|  +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.4.0:compile
|  +- org.apache.cxf:cxf-rt-frontend-simple:jar:2.4.0:compile
|  \- org.apache.cxf:cxf-rt-ws-addr:jar:2.4.0:compile
+- org.apache.cxf:cxf-rt-transports-http:jar:2.4.0:compile
|  +- org.apache.cxf:cxf-rt-transports-common:jar:2.4.0:compile
|  \- org.springframework:spring-web:jar:3.0.5.RELEASE:compile
|     +- aopalliance:aopalliance:jar:1.0:compile
|     +- org.springframework:spring-beans:jar:3.0.5.RELEASE:compile
|     +- org.springframework:spring-context:jar:3.0.5.RELEASE:compile
|     |  +- org.springframework:spring-aop:jar:3.0.5.RELEASE:compile
|     |  +- org.springframework:spring-expression:jar:3.0.5.RELEASE:compile
|     |  \- org.springframework:spring-asm:jar:3.0.5.RELEASE:compile
|     \- org.springframework:spring-core:jar:3.0.5.RELEASE:compile
+- xalan:xalan:jar:2.7.1:compile
|  \- xalan:serializer:jar:2.7.1:compile
|     \- xml-apis:xml-apis:jar:1.3.04:compile
+- commons-logging:commons-logging:jar:1.1.1:compile
+- junit:junit:jar:4.8.2:test
\- javax.servlet:servlet-api:jar:2.4:provided