Ngramj-language-extraction/1.4.0

From WebLab Wiki
Jump to: navigation, search
Ngramj Language Extraction component
Details
Service Interfaces Analyser
Exchange model: WebLab 1.2.5
Versions: <ListSubPages />
Licence LGPL 2.1
Supported OS Windows/Linux/MacOs
Integrated COTS NGramJ
Binary ngramj-language-extraction-1.4.0.war
Sources ngramj-language-extraction-1.4.0-sources.jar
Javadoc ngramj-language-extraction-1.4.0-javadoc.jar
SVN ngramj-language-extraction
Maven Artifact

<groupId>org.ow2.weblab.webservices</groupId>

<artifactId>ngramj-language-extraction</artifactId>

<version>1.4.0</version>
Release Note



This service can be used for identifying the language of a Text.

It's a wrapper of the NGramJ project: http://ngramj.sourceforge.net/. It uses the CNGram system that can computes character string instead of raw text files.

This algorithm return for each input text a score associated to every language profile previously learned (.ngp files). The score is a double between 0 and 1. 1 meaning that this text is written in this language for sure. 0 on the opposite means that this text is not written in this language. The sum of score equals 1.

Our wrapper annotate every Text section of a Document in input (or the Text if the input is a Text). It fails if the input is something else. On each Text it uses CNGram to determine which language profile are the best candidate to be annotated (using DC:language property).

Configuration

It can be configured using the CXF/Spring bean loading IoC pattern through constructorArgs. This class contains 5 constructors, that make call recursivelly from the simple one to the more complete one by using default values for each non defined parameters. It contains the following constructors:

  • public LanguageExtraction()
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath)
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath, String isProducedByObject)
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath, String isProducedByObject, String unknownLanguageCode)
  • public LanguageExtraction(int maxNbValues, boolean addTopLevelAnnot, boolean addMediaUnitLevelAnnot, String profilesFolderPath, String isProducedByObject, String unknownLanguageCode, double minSingleValue, double minMultipleValue)

The parameters that can be set are the following:

  • maxNbValues: It's a positive integer value. The list of annotated language on a given Text could not be greater that this value.
  • addTopLevelAnnot: It's a boolean value. It defines whether or not to annotate the whole document with the language extracted from the concatenation of every Text content.
  • addMediaUnitLevelAnnot: It's a boolean value. It defines whether or not to annotate the each Text section with the language guessed.
  • profilesFolderPath: It's a String that represents a folder path; This folder contains .ngp files that will be loaded instead of default CNGram 28 languages.
  • isProducedByObject: It's a String value that should be a valid URI. It defines the URI to be used as object of every isProducedBy statements on annotations created by the service.
  • unknownLanguageCode: It's the String value that will be annotated when no language can be clearly identified. When null, nothing is annotated.
  • minSingleValue: It's a double value between 0 and 1. If the best language score is greater than this value, it will be the only one annotated on a given Text.
  • minMultipleValue: It's a double value between 0 and 1. Every language score that are greater than this value, will be annotated on a given Text.


Those 8 properties are optional. Default values are:

  • maxNbValues: 1
  • addTopLevelAnnot: false
  • addTopLevelAnnot: true
  • profilesFolderPath: in this case, we use the default constructor for CNGram profile that will use default profile given in their jar file. These 28 profiles are named using ISO 639-1 two letters language code; it means that the DC:language annotation resulting will be in this format. If you want to use another format, you have use a custom profiles folder (containing .ngp files).
  • isProducedByObject: null; in this case, no isProducedBy annotation will be created.
  • unknownLanguageCode: 'und' is used, since it is the code for undetermined in ISO-639-x.
  • minSingleValue: 0.75
  • minMultipleValue: 0.15

UsageContext effects

UsageContext is not used in this service.

Examples of SOAP Input/Output

Analyser:process

Input

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:anal="http://weblab.ow2.org/core/1.2/services/analyser">
   <soapenv:Header/>
   <soapenv:Body>
      <anal:processArgs>
         <resource xsi:type="model:Document" uri="weblab://SmallEnglishTest/1" xmlns:model="http://weblab.ow2.org/core/1.2/model#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <mediaUnit xsi:type="model:Text" uri="weblab://SmallEnglishTest/1#0">
               <content>WebLab: An integration infrastructure to ease the development of multimedia processing applications
Patrick GIROUX, Stephan BRUNESSAUX, Sylvie BRUNESSAUX, Jérémie DOUCY, Gérard DUPONT, Bruno GRILHERES, Yann MOMBRUN, Arnaud SAVAL
Information Processing Control and Cognition (IPCC)
EADS Defence and Security Systems
Parc d'Affaire des Portes
27106 Val de Reuil
 
 
 
 
http://weblab-project.org
 
 
 
ipcc@weblab-project.org
 
 
 
 
{patrick.giroux, stephan.brunessaux, sylvie.brunessaux, jeremie.doucy, gerard.dupont, bruno.grilheres, yann.mombrun, arnaud.saval}@eads.com
 
 
 
Abstract:
In this paper, we introduce the EADS' WebLab platform (http://weblab-project.org) that aims at providing an integration infrastructure for multimedia information processing components. In the following, we explain the motivations that have led to the realisation of this project within EADS and the requirements that have led our choices. After a quick review of existing information processing platforms, we present the chosen service oriented architecture, and the three layers of the WebLab project (infrastructure, services and applications).
Then, we detail the chosen exchange model and normalised services interfaces that enable semantic interoperability between information processing components. We present the technical choices made to guarantee technical interoperability between the components by the use of an Enterprise Service Bus (ESB).
Moreover, we present the orchestration and portal mechanisms that we have added to the WebLab to enable architects to quickly build multimedia processing applications. In the following, we illustrate the integration process by describing three applications that have been developed on top of this architecture on three R&amp;D projects (Vitalas, WebContent and eWok-Hub). Finally, we propose some perspectives such as the realisation of an information processing services directory, or a toolkit following MDA (Model Driven Architecture) approach to ease the integration process.
 
 
 
Keywords:
Integration infrastructure, Service Oriented Architecture, Semantics, Multimedia Information Processing Platform.</content>
            </mediaUnit>
         </resource>
      </anal:processArgs>
   </soapenv:Body>
</soapenv:Envelope>

Output

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
   <soap:Body>
      <anal:processReturn xmlns:wl="http://weblab.ow2.org/core/1.2/model#" xmlns:anal="http://weblab.ow2.org/core/1.2/services/analyser">
         <resource xsi:type="wl:Document" uri="weblab://SmallEnglishTest/1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <annotation uri="weblab://SmallEnglishTest/1#a2">
               <data>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
                     <rdf:Description rdf:about="weblab://SmallEnglishTest/1">
                        <dc:language>en</dc:language>
                     </rdf:Description>
                     <rdf:Description rdf:about="weblab://SmallEnglishTest/1#a2" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                        <dcterms:created>2012-03-06T16:21:11+0100</dcterms:created>
                        <wp:isProducedBy rdf:resource="http://weblab.ow2.org/services#LanguageExtraction"/>
                     </rdf:Description>
                  </rdf:RDF>
               </data>
            </annotation>
            <mediaUnit xsi:type="wl:Text" uri="weblab://SmallEnglishTest/1#0">
               <annotation uri="weblab://SmallEnglishTest/1#0-a0">
                  <data>
                     <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/">
                        <rdf:Description rdf:about="weblab://SmallEnglishTest/1#0">
                           <dc:language>en</dc:language>
                        </rdf:Description>
                        <rdf:Description rdf:about="weblab://SmallEnglishTest/1#0-a0" xmlns:dcterms="http://purl.org/dc/terms/" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                           <dcterms:created>2012-03-06T16:21:11+0100</dcterms:created>
                           <wp:isProducedBy rdf:resource="http://weblab.ow2.org/services#LanguageExtraction"/>
                        </rdf:Description>
                     </rdf:RDF>
                  </data>
               </annotation>
               <content>WebLab: An integration infrastructure to ease the development of multimedia processing applications
Patrick GIROUX, Stephan BRUNESSAUX, Sylvie BRUNESSAUX, Jérémie DOUCY, Gérard DUPONT, Bruno GRILHERES, Yann MOMBRUN, Arnaud SAVAL
Information Processing Control and Cognition (IPCC)
EADS Defence and Security Systems
Parc d'Affaire des Portes
27106 Val de Reuil
 
 
 
 
http://weblab-project.org
 
 
 
ipcc@weblab-project.org
 
 
 
 
{patrick.giroux, stephan.brunessaux, sylvie.brunessaux, jeremie.doucy, gerard.dupont, bruno.grilheres, yann.mombrun, arnaud.saval}@eads.com
 
 
 
Abstract:
In this paper, we introduce the EADS' WebLab platform (http://weblab-project.org) that aims at providing an integration infrastructure for multimedia information processing components. In the following, we explain the motivations that have led to the realisation of this project within EADS and the requirements that have led our choices. After a quick review of existing information processing platforms, we present the chosen service oriented architecture, and the three layers of the WebLab project (infrastructure, services and applications).
Then, we detail the chosen exchange model and normalised services interfaces that enable semantic interoperability between information processing components. We present the technical choices made to guarantee technical interoperability between the components by the use of an Enterprise Service Bus (ESB).
Moreover, we present the orchestration and portal mechanisms that we have added to the WebLab to enable architects to quickly build multimedia processing applications. In the following, we illustrate the integration process by describing three applications that have been developed on top of this architecture on three R&amp;D projects (Vitalas, WebContent and eWok-Hub). Finally, we propose some perspectives such as the realisation of an information processing services directory, or a toolkit following MDA (Model Driven Architecture) approach to ease the integration process.
 
 
 
Keywords:
Integration infrastructure, Service Oriented Architecture, Semantics, Multimedia Information Processing Platform.</content>
            </mediaUnit>
         </resource>
      </anal:processReturn>
   </soap:Body>
</soap:Envelope>

Known Limitations

N/A


Dependencies

org.ow2.weblab.webservices:ngramj-language-extraction:war:1.4.0
+- de.spieleck.app.ngramj:cngram:jar:1.0-0.060327:compile
+- log4j:log4j:jar:1.2.17:runtime
+- org.ow2.weblab.core:model:jar:1.2.5:compile
+- org.ow2.weblab.core:extended:jar:1.2.5:compile
|  \- commons-io:commons-io:jar:2.4:compile
+- org.ow2.weblab.core:annotator:jar:1.2.6:compile
|  \- joda-time:joda-time:jar:2.3:compile
+- org.apache.cxf:cxf-rt-frontend-jaxws:jar:2.6.11:compile
|  +- xml-resolver:xml-resolver:jar:1.2:compile
|  +- asm:asm:jar:3.3.1:compile
|  +- org.apache.cxf:cxf-api:jar:2.6.11:compile
|  |  +- org.codehaus.woodstox:woodstox-core-asl:jar:4.2.0:compile
|  |  |  \- org.codehaus.woodstox:stax2-api:jar:3.1.1:compile
|  |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0.3:compile
|  |  +- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile
|  |  \- wsdl4j:wsdl4j:jar:1.6.3:compile
|  +- org.apache.cxf:cxf-rt-core:jar:2.6.11:compile
|  |  \- com.sun.xml.bind:jaxb-impl:jar:2.2.5.1:compile
|  +- org.apache.cxf:cxf-rt-bindings-soap:jar:2.6.11:compile
|  |  \- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.6.11:compile
|  +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.6.11:compile
|  +- org.apache.cxf:cxf-rt-frontend-simple:jar:2.6.11:compile
|  \- org.apache.cxf:cxf-rt-ws-addr:jar:2.6.11:compile
|     \- org.apache.cxf:cxf-rt-ws-policy:jar:2.6.11:compile
|        \- org.apache.neethi:neethi:jar:3.0.2:compile
+- org.apache.cxf:cxf-rt-transports-http:jar:2.6.11:compile
+- org.apache.cxf:cxf-rt-management:jar:2.6.11:compile
+- org.springframework:spring-web:jar:3.0.7.RELEASE:compile
|  +- aopalliance:aopalliance:jar:1.0:compile
|  +- org.springframework:spring-beans:jar:3.0.7.RELEASE:compile
|  +- org.springframework:spring-context:jar:3.0.7.RELEASE:compile
|  |  +- org.springframework:spring-aop:jar:3.0.7.RELEASE:compile
|  |  +- org.springframework:spring-expression:jar:3.0.7.RELEASE:compile
|  |  \- org.springframework:spring-asm:jar:3.0.7.RELEASE:compile
|  \- org.springframework:spring-core:jar:3.0.7.RELEASE:compile
+- xalan:xalan:jar:2.7.1:compile
|  \- xalan:serializer:jar:2.7.1:compile
+- xml-apis:xml-apis:jar:1.3.04:compile
+- commons-logging:commons-logging:jar:1.1.3:compile
+- junit:junit:jar:4.11:test
|  \- org.hamcrest:hamcrest-core:jar:1.3:test
\- javax.servlet:servlet-api:jar:2.5:provided