Dockerized INDRA World service

This folder contains files to run the INDRA World service through Docker containers. It also provides files to build them locally in case customizations are needed.

Running the integrated service

A docker-compose file defines how the service image and DB image need to be run. The docker-compose file refers to two images (indralab/indra_world and indralab/indra_world_db), both available publicly on Dockerhub. This means that they are automatically pulled when running docker-compose up unless they are already available locally.

To launch the service, run

docker-compose up -d

where the optional -d flag runs the containers in the background.

There are two files that need to be created containing environment variables for each container with the following names and content:

indra_world.env

INDRA_WM_SERVICE_DB=postgresql://postgres:mysecretpassword@db:5432
DART_WM_URL=<DART URL>
DART_WM_USERNAME=<DART username>
DART_WM_PASSWORD=<DART password>
AWS_ACCESS_KEY_ID=<AWS account key ID, necessary if assembled outputs need to be dumped to S3 for CauseMos>
AWS_SECRET_ACCESS_KEY=<AWS account secret key, necessary if assembled outputs need to be dumped to S3 for CauseMos>
AWS_REGION=us-east-1
INDRA_WORLD_ONTOLOGY_URL=<GitHub URL to ontology being used, only necessary if DART is not used.>
LOCAL_DEPLOYMENT=1

Above, LOCAL_DEPLOYMENT should only be set if the service is intended to be run on and accessed from localhost. This enables the assembly dashboard app at http://localhost:8001/dashboard which can write assembled corpus output to the container’s disk (this can either be mounted to correspond to a host folder or files can be copied to the host using docker cp).

indra_world_db.env

POSTGRES_PASSWORD=mysecretpassword
PGDATA=/var/lib/postgresql/pgdata

Note that if necessary, the default POSTGRES_PASSWORD=mysecretpassword setting can be changed using standard psql commands in the indra_world_db container and then committed to an image.

Building the Docker images locally

As described above, the two necessary Docker images are available on Dockerhub, therefore the following steps are only necessary if local changes to the images (beyond what can be controlled through environmental variables) are needed.

Building the INDRA World service image

To build the indra_world Docker image, run

docker build --tag indra_world:latest .

Initializing the INDRA World DB image

To create the indra_world_db Docker image from scratch, run

./initialize_db_image.sh

Note that this requires Python dependencies needed to run INDRA World to be available in the local environment.

Using the public INDRA World API

The API is deployed and documented at wm.indra.bio.

Cloud-based CauseMos integration via S3

Access to the INDRA-assembled corpora requires credentials to the shared World Modelers S3 bucket “world-modelers”. Each INDRA-assembled corpus is available within this bucket, under the “indra_models” key base. Each corpus is identified by a string identifier.

The corpus index

The list of corpora can be obtained either using S3’s list objects function or by reading the index.csv file which is maintained by INDRA. This index is a comma separated values text file which contains one row for each corpus. Each row’s first element is a corpus identifier, and the second element is the UTC date-time at which the corpus was uploaded to S3. An example row in this file looks as follows

test1_newlines,2020-05-08-22-34-29

where test1_newlines is the corpus identifier and 2020-05-08-22-34-29 is the upload date-time.

Structure of each corpus

Within the world-modelers bucket, under the indra_models key base, files for each corpus are organized under a subkey equivalent to the corpus identifier, for instance, all the files for the test1_newlines corpus are under the indra_models/test1_newlines/ key base. The list of files for each corpus are as follows

statements.json: a JSON dump of assembled INDRA Statements. Each statement’s JSON representation is on a separate line in this file. This is the main file that CauseMos needs to ingest for UI interaction.
metadata.json: a JSON file containing key-value pairs that describe the corpus. The standard keys in this file are as follows:
- corpus_id: the ID of the corpus (redundant with the corresponding entry in the index).
- description: a human-readable description of how the corpus was obtained.
- display_name: a human-readable display name for the corpus.
- readers: a list of the names of the reading systems from which statements were obtained in the corpus.
- assembly: a dictionary identifying attributes of the assembly process with the following keys:
  - level: the level of resolution used to assemble the corpus (e.g., “location_and_time”).
  - grounding_threshold: the threshold (if any) which was used to filter statements by grounding score (e.g., 0.7)
- num_statements: the number of assembled INDRA Statements in the corpus ( i.e., statements.json).
- num_documents: the number of documents that were read by readers to produce the statements that were assembled.
- tenant: if DART is used, a corpus is typically associated with a tenant (i.e., a user or an institution); this field provides the tenant ID.

Note that any of these keys may be missing if unavailable, for instance, in the case of old uploads.