Skip to main content
Version: Next

DataHubGc

Testing

DataHubGcSource is responsible for performing garbage collection tasks on DataHub.

This source performs the following tasks:

  1. Cleans up expired tokens.
  2. Truncates Elasticsearch indices based on configuration.
  3. Cleans up data processes and soft-deleted entities if configured.

CLI based Ingestion

Install the Plugin

The datahub-gc source works out of the box with acryl-datahub.

Config Details

Note that a . is used to denote nested fields in the YAML recipe.

FieldDescription
cleanup_expired_tokens
boolean
Whether to clean up expired tokens or not
Default: True
dry_run
boolean
Whether to perform a dry run or not. This is only supported for dataprocess cleanup and soft deleted entities cleanup.
Default: False
truncate_index_older_than_days
integer
Indices older than this number of days will be truncated
Default: 30
truncate_indices
boolean
Whether to truncate elasticsearch indices or not which can be safely truncated
Default: True
truncation_sleep_between_seconds
integer
Sleep between truncation monitoring.
Default: 30
truncation_watch_until
integer
Wait for truncation of indices until this number of documents are left
Default: 10000
dataprocess_cleanup
DataProcessCleanupConfig
Configuration for data process cleanup
dataprocess_cleanup.batch_size
integer
The number of entities to get in a batch from GraphQL
Default: 500
dataprocess_cleanup.delay
number
Delay between each batch
Default: 0.25
dataprocess_cleanup.delete_empty_data_flows
boolean
Wether to delete Data Flows without runs
Default: True
dataprocess_cleanup.delete_empty_data_jobs
boolean
Wether to delete Data Jobs without runs
Default: True
dataprocess_cleanup.hard_delete_entities
boolean
Whether to hard delete entities
Default: False
dataprocess_cleanup.keep_last_n
integer
Number of latest aspects to keep
Default: 5
dataprocess_cleanup.max_workers
integer
The number of workers to use for deletion
Default: 10
dataprocess_cleanup.retention_days
integer
Number of days to retain metadata in DataHub
Default: 10
dataprocess_cleanup.aspects_to_clean
array
List of aspect names to clean up
Default: ['DataprocessInstance']
dataprocess_cleanup.aspects_to_clean.string
string
soft_deleted_entities_cleanup
SoftDeletedEntitiesCleanupConfig
Configuration for soft deleted entities cleanup
soft_deleted_entities_cleanup.batch_size
integer
The number of entities to get in a batch from GraphQL
Default: 500
soft_deleted_entities_cleanup.delay
number
Delay between each batch
Default: 0.25
soft_deleted_entities_cleanup.max_workers
integer
The number of workers to use for deletion
Default: 10
soft_deleted_entities_cleanup.platform
string
Platform to cleanup
soft_deleted_entities_cleanup.query
string
Query to filter entities
soft_deleted_entities_cleanup.retention_days
integer
Number of days to retain metadata in DataHub
Default: 10
soft_deleted_entities_cleanup.env
string
Environment to cleanup
soft_deleted_entities_cleanup.entity_types
array
List of entity types to cleanup
soft_deleted_entities_cleanup.entity_types.string
string

Code Coordinates

  • Class Name: datahub.ingestion.source.gc.datahub_gc.DataHubGcSource
  • Browse on GitHub

Questions

If you've got any questions on configuring ingestion for DataHubGc, feel free to ping us on our Slack.