Folder-listener/1.1

From WebLab Wiki
(Redirected from Folder Listener)
Jump to: navigation, search
Folder Listener
Details
Service Interfaces QueueManager
Exchange model: WebLab 1.2.2
Versions: <ListSubPages />
Licence LGPL 2.1
Supported OS Windows/Linux/MacOs
Integrated COTS Commons-IO
Binary folder-listener-1.1.war
Sources folder-listener-1.1-sources.jar
Javadoc folder-listener-1.1-javadoc.jar
SVN folder-listener
Maven Artifact

<groupId>org.ow2.weblab.webservices</groupId>

<artifactId>folder-listener</artifactId>

<version>1.1</version>
Release Note


This service browse a directory of files and create a WebLab Resource that refers to that file. This service can listen a directory and update automatically the list of files to process.

In fact, this service contains three distinct implementation of the functionality, depending of the kind of file to be read:

  • DefaultFolderListener reads any kind of file and generate one simple empty Document per file
  • WarcListener reads the Web-ARChive files (WARC). It generates one simple empty Document per entry in each WARC found.
  • WebLabResourceCrawler reads the WebLab XML resources file. It returns the unmarshalled resource for each XML file.
TODO: This part has to be completed. Add Output samples of WarcListerner and WebLabResourceListener

Configuration

The configuration is done using the cxfBeanFile.xml file. It depends on the implementation you wan't to use (even if many parameters are in common).

DefaultFolderListener configuration

	<jaxws:endpoint id="folder-listener" implementor="#fileListener" address="/folder-listener" />
 
	<bean id="fileListener" class="org.ow2.weblab.webservices.listener.DefaultFolderListener">
		<constructor-arg ref="file-config" />
	</bean>
 
	<bean id="file-config" class="org.ow2.weblab.webservices.listener.bean.FolderListenerConfigBean">
		<property name="fileUrlPattern" value=".*" />
		<property name="updateInterval" value="1000" />
		<property name="delete" value="false" />
		<property name="recursive" value="true" />
		<property name="folders">
			<list>
				<bean class="java.io.File">
					<constructor-arg>
						<value>toIndex</value>
					</constructor-arg>
				</bean>
			</list>
		</property>
		<property name="preInitialisedUsageContext">
			<list />
		</property>
		<property name="failForNonInitialisedUsageContext" value="false" />
		<property name="defaultUsageContext" value="" />
		<property name="oneShotCrawler" value="false" />
	</bean>

The default folder listener allows to set up the following parameters:

  • fileUrlPattern : a regexp to allow filenames. By default all file are allowed: .*.
  • updateInterval : the time in ms between two check of new files in a directory. By default: 1000ms.
  • folders : contains the list of folders to listen to. By default: the toIndex.
  • delete: whether or not to delete files after gathering. By default: false.
  • recursive: whether to go in subdirectories. By default: true.
  • preInitialisedUsageContext: list of usageContext that can be initialised before start (to save time at the first call). By default: none.
  • failForNonInitialisedUsageContext: whether to crash for an unknown usageContext. By default: false.
  • defaultUsageContext: the usageContext to be used in the request is empty. By default: "".
  • oneShotCrawler: whether search file at initialisation and fix the files to gather (no listening mode). By default: "false".


WarcListener configuration

	<jaxws:endpoint id="warc-listener" implementor="#warcListener" address="/warc-listener" />
 
	<bean id="warcListener" class="org.ow2.weblab.webservices.listener.WarcListener">
		<constructor-arg ref="warc-config" />
	</bean>
 
	<bean id="warc-config" class="org.ow2.weblab.webservices.listener.bean.WarcConfigBean">
		<property name="fileUrlPattern" value="(.*\.warc)|(.*\.warc\.gz)" />
		<property name="updateInterval" value="1000" />
		<property name="delete" value="false" />
		<property name="recursive" value="true" />
		<property name="folders">
			<list>
				<bean class="java.io.File">
					<constructor-arg>
						<value>warcs</value>
					</constructor-arg>
				</bean>
			</list>
		</property>
		<property name="preInitialisedUsageContext">
			<list />
		</property>
		<property name="failForNonInitialisedUsageContext" value="false" />
		<property name="defaultUsageContext" value="" />
		<property name="oneShotCrawler" value="false" />
 
		<property name="exposedAsPrefix" value="http://localhost:8080/wayback/" />
		<property name="acceptedContentTypePattern" value="(.*html.*)|(.*postscript.*)|(.*octet-stream.*)|(.*excel.*)|(.*powerpoint.*)|(.*word.*)|(.*pdf.*)|(.*opendocument.*)|(.*star.*)" />
		<property name="acceptedMessagePattern" value="(.*response.*)" />
		<property name="acceptedHeaders" value=".*HTTP\S*\s*2\d{2}.*" />
	</bean>

The warc listener allows to set up the following parameters:

  • fileUrlPattern : a regexp to allow filenames. By default all file are allowed: (.*\.warc)|(.*\.warc\.gz)* (i.e. warcs and gzipped warcs).
  • updateInterval : the time in ms between two check of new files in a directory. By default: 1000ms.
  • folders : contains the list of folders to listen to. By default: the warcs.
  • delete: whether or not to delete files after gathering. By default: false.
  • recursive: whether to go in subdirectories. By default: true.
  • preInitialisedUsageContext: list of usageContext that can be initialised before start (to save time at the first call). By default: none.
  • failForNonInitialisedUsageContext: whether to crash for an unknown usageContext. By default: false.
  • defaultUsageContext: the usageContext to be used in the request is empty. By default: "".
  • oneShotCrawler: whether search file at initialisation and fix the files to gather (no listening mode). By default: "false".
  • exposedAsPrefix: the first part of the URL to be used as isExposedAs URI (most of the time it is the location of the Wayback machine webapp). By default: http://localhost:8080/wayback/.
  • acceptedContentTypePattern: the regexp that the content type in the WARC should match to be crawled. By default: (.*html.*)|(.*postscript.*)|(.*octet-stream.*)|(.*excel.*)|(.*powerpoint.*)|(.*word.*)|(.*pdf.*)|(.*opendocument.*)|(.*star.*).
  • acceptedMessagePattern: the regexp that the message type should match to be crawled. By default: (.*response.*).
  • acceptedHeaders: the regexp that the HTTP header of the response should match to be crawled. By default: .*HTTP\S*\s*2\d{2}.*.

WebLabResourceCrawler configuration

	<jaxws:endpoint id="resource-listener" implementor="#resourceListener" address="/resource-listener" />
 
	<bean id="resourceListener" class="org.ow2.weblab.webservices.listener.WebLabResourceCrawler">
		<constructor-arg ref="resource-config" />
	</bean>
 
	<bean id="resource-config" class="org.ow2.weblab.webservices.listener.bean.FolderListenerConfigBean">
		<property name="fileUrlPattern" value=".*.xml" />
		<property name="updateInterval" value="1000" />
		<property name="delete" value="false" />
		<property name="recursive" value="true" />
		<property name="folders">
			<list>
				<bean class="java.io.File">
					<constructor-arg>
						<value>toIndex</value>
					</constructor-arg>
				</bean>
			</list>
		</property>
		<property name="preInitialisedUsageContext">
			<list />
		</property>
		<property name="failForNonInitialisedUsageContext" value="true" />
		<property name="defaultUsageContext" value="" />
		<property name="oneShotCrawler" value="false" />
	</bean>

The WebLab resource listener allows to set up the following parameters:

  • fileUrlPattern : a regexp to allow filenames. By default all file are allowed: .*.xml.
  • updateInterval : the time in ms between two check of new files in a directory. By default: 1000ms.
  • folders : contains the list of folders to listen to. By default: the toIndex.
  • delete: whether or not to delete files after gathering. By default: false.
  • recursive: whether to go in subdirectories. By default: true.
  • preInitialisedUsageContext: list of usageContext that can be initialised before start (to save time at the first call). By default: none.
  • failForNonInitialisedUsageContext: whether to crash for an unknown usageContext. By default: false.
  • defaultUsageContext: the usageContext to be used in the request is empty. By default: "".
  • oneShotCrawler: whether search file at initialisation and fix the files to gather (no listening mode). By default: "false".


UsageContext effects

The usage context is used to make the difference between various queues. I.e. if 5 files are in the folder to crawl, then after 5 calls with the same usageContext, a EmptyQueueException will be thrown at each call with this usage context. Using another one can let the component return the same resources.

Examples of SOAP Input/Output

QueueManager:nextResource

Input

<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:que="http://weblab.ow2.org/core/1.2/services/queuemanager">
   <soapenv:Header/>
   <soapenv:Body>
      <que:nextResourceArgs/>
   </soapenv:Body>
</soapenv:Envelope>

Output From DefaultFolderListerner

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
   <soap:Body>
      <ns9:nextResourceReturn xmlns:ns2="http://weblab.ow2.org/core/1.2/services/analyser" xmlns:ns3="http://weblab.ow2.org/core/1.2/services/searcher" xmlns:ns4="http://weblab.ow2.org/core/1.2/services/configurable" xmlns:ns5="http://weblab.ow2.org/core/1.2/services/resourcecontainer" xmlns:ns6="http://weblab.ow2.org/core/1.2/model#" xmlns:ns7="http://weblab.ow2.org/core/1.2/services/trainable" xmlns:ns8="http://weblab.ow2.org/core/1.2/services/exception" xmlns:ns9="http://weblab.ow2.org/core/1.2/services/queuemanager" xmlns:ns10="http://weblab.ow2.org/core/1.2/services/sourcereader" xmlns:ns11="http://weblab.ow2.org/core/1.2/services/reportprovider" xmlns:ns12="http://weblab.ow2.org/core/1.2/services/indexer">
         <resource xsi:type="ns6:Document" uri="weblab://folder-listener/file545844151145181" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <annotation uri="weblab://folder-listener/file545844151145181#a0">
               <data>
                  <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:wp="http://weblab.ow2.org/core/1.2/ontology/processing#">
                     <rdf:Description rdf:about="weblab://folder-listener/file545844151145181" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/">
                        <wp:hasNativeContent rdf:resource="file:/home/weblab/data/content/weblab.2361463621386300376.content"/>
                        <wp:hasGatheringDate>2012-04-25T18:26:22+0200</wp:hasGatheringDate>
                        <wp:hasOriginalFileName>Letters.pdf</wp:hasOriginalFileName>
                        <wp:hasOriginalFileSize>505492</wp:hasOriginalFileSize>
                        <dc:source>/home/weblab/toIndex/Letters.pdf</dc:source>
                        <dcterms:extent>505492 bytes</dcterms:extent>
                        <dcterms:modified>2010-11-04T11:36:42+0100</dcterms:modified>
                     </rdf:Description>
                  </rdf:RDF>
               </data>
            </annotation>
         </resource>
      </ns9:nextResourceReturn>
   </soap:Body>
</soap:Envelope>

Output From WarcListener

Output From WebLabResourceListener

Known Limitations

N/A

Dependencies

List off all dependencies of this service:

org.ow2.weblab.webservices:folder-listener:war:1.1
+- commons-io:commons-io:jar:2.0.1:compile
+- org.archive.heritrix:heritrix-modules:jar:3.1.0:compile
|  +- org.archive.heritrix:heritrix-commons:jar:3.1.0:compile
|  |  +- org.archive.overlays:archive-overlay-commons-httpclient:jar:3.1:compile
|  |  +- com.sleepycat:je:jar:4.1.6:compile
|  |  +- commons-lang:commons-lang:jar:2.6:compile
|  |  +- commons-net:commons-net:jar:2.0:compile
|  |  +- commons-codec:commons-codec:jar:1.5:compile (version managed from 1.3)
|  |  +- commons-collections:commons-collections:jar:3.1:compile
|  |  +- commons-cli:commons-cli:jar:1.2:compile (version managed from 1.1)
|  |  +- net.htmlparser.jericho:jericho-html:jar:2.6.1:compile
|  |  +- org.dnsjava:dnsjava:jar:2.0.3:compile
|  |  +- poi:poi:jar:2.5.1-final-20040804:compile
|  |  +- poi:poi-scratchpad:jar:2.5.1-final-20040804:compile
|  |  +- com.lowagie:itext:jar:1.3:compile
|  |  +- fastutil:fastutil:jar:5.0.7:compile
|  |  +- org.gnu.inet:libidn:jar:0.6.5:compile
|  |  +- net.java.dev.jets3t:jets3t:jar:0.5.0:compile
|  |  +- it.unimi.dsi:mg4j:jar:1.0.1:compile
|  |  +- com.anotherbigidea:javaswf:jar:CVS-SNAPSHOT-1:compile
|  |  +- org.springframework:spring-core:jar:3.0.5.RELEASE:compile
|  |  +- org.springframework:spring-beans:jar:3.0.5.RELEASE:compile
|  |  +- org.springframework:spring-context:jar:3.0.5.RELEASE:compile
|  |  |  \- org.springframework:spring-aop:jar:3.0.5.RELEASE:compile
|  |  +- org.springframework:spring-asm:jar:3.0.5.RELEASE:compile
|  |  +- org.springframework:spring-expression:jar:3.0.5.RELEASE:compile
|  |  +- org.json:json:jar:20090211:compile
|  |  +- com.esotericsoftware:kryo:jar:1.01:compile
|  |  +- com.esotericsoftware:reflectasm:jar:0.8:runtime
|  |  +- com.esotericsoftware:minlog:jar:1.2:runtime
|  |  +- com.google.guava:guava:jar:r08:compile
|  |  \- net.java.dev.jna:jna:jar:3.2.3:compile
|  +- org.beanshell:bsh:jar:2.0b5:compile
|  \- org.codehaus.groovy:groovy-all:jar:1.6.3:compile
|     +- org.apache.ant:ant:jar:1.7.1:compile
|     |  \- org.apache.ant:ant-launcher:jar:1.7.1:compile
|     \- jline:jline:jar:0.9.94:compile
+- org.ow2.weblab.core:annotator:jar:1.2.4:compile
|  \- joda-time:joda-time:jar:1.6.2:compile
+- org.ow2.weblab.components:content-manager:jar:1.9:compile
+- log4j:log4j:jar:1.2.16:compile
+- org.ow2.weblab.core:model:jar:1.2.2:compile
+- org.ow2.weblab.core:extended:jar:1.2.2:compile
+- org.apache.cxf:cxf-rt-frontend-jaxws:jar:2.4.0:compile
|  +- xml-resolver:xml-resolver:jar:1.2:compile
|  +- asm:asm:jar:3.3:compile
|  +- org.apache.cxf:cxf-api:jar:2.4.0:compile
|  |  +- org.apache.cxf:cxf-common-utilities:jar:2.4.0:compile
|  |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.0:compile
|  |  \- org.apache.neethi:neethi:jar:3.0.0:compile
|  |     +- wsdl4j:wsdl4j:jar:1.6.2:compile
|  |     \- org.codehaus.woodstox:woodstox-core-asl:jar:4.1.1:compile
|  |        \- org.codehaus.woodstox:stax2-api:jar:3.0.2:compile
|  +- org.apache.cxf:cxf-rt-core:jar:2.4.0:compile
|  |  +- com.sun.xml.bind:jaxb-impl:jar:2.1.13:compile
|  |  \- org.apache.geronimo.specs:geronimo-javamail_1.4_spec:jar:1.7.1:compile
|  +- org.apache.cxf:cxf-rt-bindings-soap:jar:2.4.0:compile
|  |  +- org.apache.cxf:cxf-tools-common:jar:2.4.0:compile
|  |  \- org.apache.cxf:cxf-rt-databinding-jaxb:jar:2.4.0:compile
|  +- org.apache.cxf:cxf-rt-bindings-xml:jar:2.4.0:compile
|  +- org.apache.cxf:cxf-rt-frontend-simple:jar:2.4.0:compile
|  \- org.apache.cxf:cxf-rt-ws-addr:jar:2.4.0:compile
+- org.apache.cxf:cxf-rt-transports-http:jar:2.4.0:compile
|  +- org.apache.cxf:cxf-rt-transports-common:jar:2.4.0:compile
|  \- org.springframework:spring-web:jar:3.0.5.RELEASE:compile
|     \- aopalliance:aopalliance:jar:1.0:compile
+- xalan:xalan:jar:2.7.1:compile
|  \- xalan:serializer:jar:2.7.1:compile
|     \- xml-apis:xml-apis:jar:1.3.04:compile
+- commons-logging:commons-logging:jar:1.1.1:compile
+- junit:junit:jar:4.8.2:test (scope not updated to compile)
\- javax.servlet:servlet-api:jar:2.4:provided