Dockerized INDRA World service
This folder contains files to run the INDRA World service through Docker containers. It also provides files to build them locally in case customizations are needed.
Running the integrated service
A docker-compose file defines how the service image and DB image need to be
run. The docker-compose file refers to two images (indralab/indra_world and indralab/indra_world_db), both available publicly
on Dockerhub. This means that they are automatically pulled when running
docker-compose up
unless they are already available locally.
To launch the service, run
docker-compose up -d
where the optional -d
flag runs the containers in the background.
There are two files that need to be created containing environment variables for each container with the following names and content:
indra_world.env
INDRA_WM_SERVICE_DB=postgresql://postgres:mysecretpassword@db:5432
DART_WM_URL=<DART URL>
DART_WM_USERNAME=<DART username>
DART_WM_PASSWORD=<DART password>
AWS_ACCESS_KEY_ID=<AWS account key ID, necessary if assembled outputs need to be dumped to S3 for CauseMos>
AWS_SECRET_ACCESS_KEY=<AWS account secret key, necessary if assembled outputs need to be dumped to S3 for CauseMos>
AWS_REGION=us-east-1
INDRA_WORLD_ONTOLOGY_URL=<GitHub URL to ontology being used, only necessary if DART is not used.>
LOCAL_DEPLOYMENT=1
Above, LOCAL_DEPLOYMENT
should only be set if the service is intended to
be run on and accessed from localhost. This enables the assembly dashboard
app at http://localhost:8001/dashboard
which can write assembled corpus
output to the container’s disk (this can either be mounted to correspond to
a host folder or files can be copied to the host using docker cp).
indra_world_db.env
POSTGRES_PASSWORD=mysecretpassword
PGDATA=/var/lib/postgresql/pgdata
Note that if necessary, the default POSTGRES_PASSWORD=mysecretpassword
setting
can be changed using standard psql
commands in the indra_world_db
container
and then committed to an image.
Building the Docker images locally
As described above, the two necessary Docker images are available on Dockerhub, therefore the following steps are only necessary if local changes to the images (beyond what can be controlled through environmental variables) are needed.
Building the INDRA World service image
To build the indra_world
Docker image, run
docker build --tag indra_world:latest .
Initializing the INDRA World DB image
To create the indra_world_db
Docker image from scratch, run
./initialize_db_image.sh
Note that this requires Python dependencies needed to run INDRA World to be available in the local environment.
Using the public INDRA World API
The API is deployed and documented at wm.indra.bio.
Cloud-based CauseMos integration via S3
Access to the INDRA-assembled corpora requires credentials to the shared World Modelers S3 bucket “world-modelers”. Each INDRA-assembled corpus is available within this bucket, under the “indra_models” key base. Each corpus is identified by a string identifier.
The corpus index
The list of corpora can be obtained either using S3’s list objects function or by reading the index.csv file which is maintained by INDRA. This index is a comma separated values text file which contains one row for each corpus. Each row’s first element is a corpus identifier, and the second element is the UTC date-time at which the corpus was uploaded to S3. An example row in this file looks as follows
test1_newlines,2020-05-08-22-34-29
where test1_newlines is the corpus identifier and 2020-05-08-22-34-29 is the upload date-time.
Structure of each corpus
Within the world-modelers bucket, under the indra_models key base, files for each corpus are organized under a subkey equivalent to the corpus identifier, for instance, all the files for the test1_newlines corpus are under the indra_models/test1_newlines/ key base. The list of files for each corpus are as follows
statements.json: a JSON dump of assembled INDRA Statements. Each statement’s JSON representation is on a separate line in this file. This is the main file that CauseMos needs to ingest for UI interaction.
metadata.json: a JSON file containing key-value pairs that describe the corpus. The standard keys in this file are as follows:
corpus_id: the ID of the corpus (redundant with the corresponding entry in the index).
description: a human-readable description of how the corpus was obtained.
display_name: a human-readable display name for the corpus.
readers: a list of the names of the reading systems from which statements were obtained in the corpus.
assembly: a dictionary identifying attributes of the assembly process with the following keys:
level: the level of resolution used to assemble the corpus (e.g., “location_and_time”).
grounding_threshold: the threshold (if any) which was used to filter statements by grounding score (e.g., 0.7)
num_statements: the number of assembled INDRA Statements in the corpus ( i.e., statements.json).
num_documents: the number of documents that were read by readers to produce the statements that were assembled.
tenant: if DART is used, a corpus is typically associated with a tenant (i.e., a user or an institution); this field provides the tenant ID.
Note that any of these keys may be missing if unavailable, for instance, in the case of old uploads.