The bundle aims to gather coherent services and portlets around the WebLab platform to demonstrate its capability in terms of service integration and orchestration for unstructured document processing and retrieval. The bundle can crawl a local folder (toIndex) in order to analyze text based documents, index them, and finally offer access to them through a portal. The processing capabilities are limited (only default rules for the named-entity extraction engine are used) but it demonstrates a complete processing chain and eases integration and testing of new components either in the processing chain or the user interface.
This bundle is regularly released on Download and build nightly with the latest services/portlets, see . The WebLab Team uses the latest Bundle to make sure new services/portlets/components satisfy integration rules/compatibility.
WebLab Bundle integrates several platforms:
- an Apache Tomcat server to deploy application services
- a Liferay server to deploy user interface portlets
- Camel an EIP messaging powered by an OGSi lightweight container Karaf to manage service chains
This bundle provides the following features:
- Desktop document processing
- WARC processing
- Metadata extraction
- Text and Metadata search
- Entity extraction (People, Organisation, Location)
- Annotated Document view
The WebLab Bundle allows you to process desktop files (word, pdf ...) and warc files.
- simple-gazetteer (not used in the default processing chain)
The following processing chain is executed in the Bundle:
These services and chains are defined and configured in a Camel Context deployed in Karaf.
1. To install the WebLab Bundle; download the last Bundle. 2. Unzip the archive named WebLab-Bundle-XXX.zip
You're done !
The WebLab Bundle is controlled through the WebLab Launcher.
To start WebLab on Linux/Mac:
To start WebLab on Windows:
To stop WebLab on Linux/Mac:
To stop WebLab on Windows:
The WebLab Launcher has more advanced features documented here.
WebLab configurations files are located in the conf directory. Configuration options are detailed in the Launcher documentation.
We provide some data resources allowing you to test the WebLab Bundle functionalities in the data directory.
- the toIndex directory contains pdf files about WebLab
- the warcs directory contains a warc file crawled from the weblab-project.org
A typical use is:
- add documents (word, pdf ...) to the data/toIndex directory
- open your browser at http://localhost:8080
- click on the Search tab to search for your documents
Getting Started (Screencast)
This video demonstrates WebLab Bundle capabilities: