Bundle 1.2.2

From WebLab Wiki
Jump to navigationJump to search

The bundle aims to gather coherent services and portlets around the WebLab platform to demonstrate its capability in term of service integration and orchestration for unstructured document processing and retrieval. The bundle can crawl a local folder (toIndex) in order to analyze text based documents, index them to finally offer access to them through a portal. The processing capabilities are limited (only default rules for the named-entity extraction engine are used) but it allows to have a complete processing chain and ease integration and test of new components either on processing chain or on user interface.

This bundle is regularly released on Download and build nightly with latest services/portlets, see [1].

Content

This bundle presents an information retrieval system based on the complete WebLab architecture. It is mainly composed of the following WebLab services:

  • an homemade folder crawler able to listen and crawl the content of a given folder: Folder Listener,
  • a normaliser that will extract the text content of various files (ms-office, pdf, rtf, etc.) based on Apache Tika: Normaliser using Tika,
  • a named entities extraction service that detects words in the document and annotate it in documents, based on gazetteers: Simple Gazetteer,
  • an indexer that will index the text content and make it searchable based on Apache SOLR: Solr Indexer.


In addition to these services, we can found some technical services.


The demo also contains a WebLab chain,

  • that chains the previously mentioned services;


and four WebLab portlets:

  • a launchCrawl portlet that will launch and monitor the processing of documents with the chain,
  • a search portlet that will launch query on the SOLR searcher,
  • a result portlet that display the results of the query,
  • a annotated document portlet that display the document annotated with the annotation added by the named entities extraction service.

Prerequisite

You should have a jdk1.6.25 or greater installed in order to run the WebLab demo, JAVA_HOME must be declared and java must be available in your path Ports 8080, 8005, 8009 (for Liferay), 8181, 8105, 8109 (for Tomcat) 8084, 7600, 7700, 7800, 7900 (for PEtALS) should be available on the computer that runs the WebLab Your computer should have at least 3Go of RAM to run the WebLab demo but 4Go is recommended to run to process more efficiently. Remember that WebLab is a server application and not a desktop one.

Installation

Unzip the bundle archive, then the installation is done.

The demo uses mainly the following ports: Liferay on 8181, Tomcat on 8080 and PEtALS on 8084. Other ports are used to monitor shutdown/restart command, however since all is installed on one machine, there is no specific network configuration needed. Only port 8080 needs to be opened if you want to connect an external client to the machine that support the WebLab.

Launch

  1. Launch the script run.sh (Linux) or run.bat (Windows) regarding your OS, (It may takes several minutes)
  2. Go with your favorite browser to http://localhost:8080/
  3. Log in with email demo@weblab-project.org password demo,
  4. You can now launch an indexation that will crawl and analyse the content of your "toIndex" folder. 

By default, you will find inside some document about the Weblab project. Then you can search on the acquired documents (for example type in the search field weblab, to find document containing the word weblab)

If you want to restart a new indexing session from scratch, launch the script called reset.sh (Linux) or reset.bat (Windows). It will clear the current index and repository and make the WebLab ready for new indexing session. You don't need to stop the application before calling a reset.

You can stop the application by calling the stop.sh (Linux) or stop.bat (Windows). Note that the servers may take several seconds to be stopped depending on your hardware resources.

Going Further

You may want to go further by adding some Services or Portlets to this demonstration. To do so you can use the WebLab Developer Dashboard to learn developing services, portlets and chains. You can also ask your questions on the WebLab mailing list: user@weblab-project.org

NB: "toIndex" folder contains a file provided for testing purpose.