The bundle aims to gather coherent services and portlets around the WebLab platform to demonstrate its capability in terms of service integration and orchestration for unstructured document processing and retrieval. The bundle can crawl a local folder (toIndex) in order to analyze text based documents, index them, and finally offer access to them through a portal. The processing capabilities are limited (only default rules for the named-entity extraction engine are used) but it demonstrates a complete processing chain and eases integration and testing of new components either in the processing chain or the user interface.
This bundle is regularly released on Download and build nightly with the latest services/portlets, see . The WebLab Team uses the latest Bundle to make sure new services/portlets/components satisfy integration rules/compatibility.
To summarise, the WebLab Bundle is a simple demo for document processing. It can serve as a starting point to learn and understand the platform while enabling the development of more complex applications. It can process a set of documents and make them searchable through text content and metadata extracted (either simple properties of the files or information extracted from the content).
This bundle provides the following features:
- Desktop document processing
- WARC (Web ARChive files resulting from crawls) processing
- Metadata extraction
- Entity extraction (People, Organisation, Location)
- Text and Metadata search
- Annotated Document view
The WebLab Bundle allows you to process desktop files (word, pdf, etc.) and warc files.
What's in the bundle
WebLab Bundle integrates several servers to support different components:
- an Apache Tomcat server to deploy application services
- a Liferay server to deploy user interface portlets
- a Camel EIP messaging powered by an OGSi lightweight container Karaf to manage service chains. More information on this part of the platform in Bundle_2.0.0/ESB-Management
Included services to enable document processing:
- Tika-normaliser to convert native files into WebLab resources,
- Ngramj-language-extraction to detect language of text resources,
- Simple-gazetteer to extract named entities based on a simple dictionary (not used in the default processing chain),
- Gate-extraction to extract named entities using more advanced NLP,
- Solr-engine to index WebLab resource and search inside,
- Simple-file-repository to store and retrieve XML file of WebLab resources.
Included portlets for user interface in web portal:
The following processing chain is executed in the Bundle:
These services and chains are defined and configured in a Camel Context deployed in Karaf.
What's NOT in the bundle
To ease its deployment and early tests in a "simple demo", not all components of WebLab are included in the bundle. Among the big "missing pieces" are :
- web crawler, various solutions exist and crawling the web raises a lot of legal difficulties (from private data management to simply copyright), thus it is not included in the bundle. But it is compatible with WARC fril format which is the standard archive format for web crawl.
- advanced text processing such as translation since these are processing intensive.
- image processing services (same reason as above).
- knowledge management services and user interface.
- sharing and publication components.
1. To install the WebLab Bundle; download the latest Bundle. 2. Unzip the archive named WebLab-Bundle-Karaf-XXX.zip
Getting Started (Screencast)
This video demonstrates WebLab Bundle capabilities:
TODO: This part has to be completed.
Getting Started (step-by-step)
To start WebLab on Linux/Mac:
A typical use is:
- add documents (word, pdf ...) to the data/toIndex directory for the system to process them (automatically)
- open your browser at http://localhost:8080
- log in the demonstrator
- click on the Search tab to search for these documents
We provide some data resources allowing you to test the WebLab Bundle functionalities in the data directory.
- the toIndex directory contains a set of desktop files (pdf publications about WebLab).
- the warcs directory contains a warc file crawled from the weblab-project.org website.
The first set is meant to mimic general document processing capabilities (such as local files or shared folder in intranet) while the second is similar to web data acquisition.
To stop WebLab on Linux/Mac:
The WebLab Bundle is controlled through the WebLab Launcher. One can interact with the WebLab launcher through commmand line (weblab.sh under Linux/Mac and weblab.bat for windows). Advanced features are documented here.
Most important WebLab configurations files are located in the conf directory. Configuration options are detailed in the Launcher documentation.
You can dive quickly into tutorials to build upon the bundle.