Bundle 2.0.0

From WebLab Wiki
(Redirected from Bundle 2.0.2)
Jump to: navigation, search

The bundle aims to gather coherent services and portlets around the WebLab platform to demonstrate its capability in terms of service integration and orchestration for unstructured document processing and retrieval. The bundle can crawl a local folder (toIndex) in order to analyze text based documents, index them, and finally offer access to them through a portal. The processing capabilities are limited (only default rules for the named-entity extraction engine are used) but it demonstrates a complete processing chain and eases integration and testing of new components either in the processing chain or the user interface.

This bundle is regularly released on Download and build nightly with the latest services/portlets, see [1]. The WebLab Team uses the latest Bundle to make sure new services/portlets/components satisfy integration rules/compatibility.

Overview

To summarise, the WebLab Bundle is a simple demo for document processing. It can serve as a starting point to learn and understand the platform while enabling the development of more complex applications. It can process a set of documents and make them searchable through text content and metadata extracted (either simple properties of the files or information extracted from the content).

This bundle provides the following features:

  • Desktop document processing
  • WARC (Web ARChive files resulting from crawls) processing
  • Metadata extraction
  • Entity extraction (People, Organisation, Location)
  • Text and Metadata search
  • Annotated Document view

The WebLab Bundle allows you to process desktop files (word, pdf, etc.) and warc files.

What's in the bundle

WebLab Bundle integrates several servers to support different components:

Included services to enable document processing:

Included portlets for user interface in web portal:

The following processing chain is executed in the Bundle:

chaine.png

These services and chains are defined and configured in a Camel Context deployed in Karaf.

What's NOT in the bundle

To ease its deployment and early tests in a "simple demo", not all components of WebLab are included in the bundle. Among the big "missing pieces" are :

  • web crawler, various solutions exist and crawling the web raises a lot of legal difficulties (from private data management to simply copyright), thus it is not included in the bundle. But it is compatible with WARC fril format which is the standard archive format for web crawl.
  • advanced text processing such as translation since these are processing intensive.
  • image processing services (same reason as above).
  • knowledge management services and user interface.
  • sharing and publication components.

Installation

1. To install the WebLab Bundle; download the latest Bundle.
2. Unzip the archive named WebLab-Bundle-Karaf-XXX.zip 

You're done!

Getting Started (Screencast)

This video demonstrates WebLab Bundle capabilities:

TODO: This part has to be completed.

Getting Started (step-by-step)

Starting WebLab

To start WebLab on Linux/Mac:

 weblab.sh start 

Using WebLab

A typical use is:

  1. add documents (word, pdf ...) to the data/toIndex directory for the system to process them (automatically)
  2. open your browser at http://localhost:8080
  3. log in the demonstrator
  4. click on the Search tab to search for these documents

We provide some data resources allowing you to test the WebLab Bundle functionalities in the data directory.

  • the toIndex directory contains a set of desktop files (pdf publications about WebLab).
  • the warcs directory contains a warc file crawled from the weblab-project.org website.

The first set is meant to mimic general document processing capabilities (such as local files or shared folder in intranet) while the second is similar to web data acquisition.

Stopping WebLab

To stop WebLab on Linux/Mac:

 weblab.sh stop 

Configuration

The WebLab Bundle is controlled through the WebLab Launcher. One can interact with the WebLab launcher through commmand line (weblab.sh under Linux/Mac and weblab.bat for windows). Advanced features are documented here.

Most important WebLab configurations files are located in the conf directory. Configuration options are detailed in the Launcher documentation.

Next steps

You can learn about WebLab architecture and specifications in Developer Dashboard, go through the documentation of the service bus deployed in the bundle.

You can dive quickly into tutorials to build upon the bundle.