Corpus manager (indra_world.service.corpus_manager)

This module allows running one-off assembly on a set of DART records (i.e., reader outputs) into a ‘seed corpus’ that can be dumped on S3 for loading into CauseMos.

class indra_world.service.corpus_manager.CorpusManager(db_url, dart_records, corpus_id, metadata, dart_client=None, tenant=None, ontology=None)[source]

Corpus manager class allowing running assembly on a set of DART records.

assemble()[source]

Run assembly on the prepared statements.

This function loads all the prepared statements associated with the corpus and then runs assembly on them.

dump_local(base_folder, causemos_compatible=True)[source]

Dump assembled corpus into local files.

dump_s3()[source]

Dump assembled corpus onto S3.

prepare(records_exist=False)[source]

Run the preprocessing pipeline on statements.

This function adds the new corpus to the DB, adds records to the new corpus, then processes the reader outputs for those records into statements, preprocesses the statements, and then stores these prepared statements in the DB.

indra_world.service.corpus_manager.download_corpus(corpus_id, fname)[source]

Download a given corpus of assembled statements from S3.

Parameters:
  • corpus_id (str) – The ID of the corpus.

  • fname (str) – The file in which the downloaded corpus should be written.

Return type:

None

indra_world.service.corpus_manager.get_corpus_index()[source]

Return the corpus index as a list of tuples with corpus IDs and dates.