Configuration

Configuration is done via YAML or JSON files or http api resources. Logprep searches for the file /etc/logprep/pipeline.yml if no configuration file is passed.

You can pass multiple configuration files via valid file paths or urls.

Valid Run Examples

logprep run /different/path/file.yml
logprep run http://url-to-our-yaml-file-or-api
logprep run http://api/v1/pipeline http://api/v1/addition_processor_pipline /path/to/connector.yaml

Security Best Practice - Configuration - Combining multiple configuration files

Consider when using multiple configuration files logprep will reject all configuration files if one can not be retrieved or is not valid. If using multiple files ensure that all can be loaded safely and that all endpoints (if using http resources) are accessible.

Configuration File Structure

Example of a complete configuration file

version: config-1.0
process_count: 2
restart_count: 5
timeout: 5
logger:
    level: INFO
input:
    kafka:
        type: confluentkafka_input
        topic: consumer
        offset_reset_policy: smallest
        kafka_config:
            bootstrap.servers: localhost:9092
            group.id: test
output:
    kafka:
        type: confluentkafka_output
        topic: producer
        error_topic: producer_error
        flush_timeout: 30
        send_timeout: 2
        kafka_config:
            bootstrap.servers: localhost:9092
pipeline:
- labelername:
    type: labeler
    schema: quickstart/exampledata/rules/labeler/schema.json
    include_parent_labels: true
    specific_rules:
        - quickstart/exampledata/rules/labeler/specific
    generic_rules:
        - quickstart/exampledata/rules/labeler/generic

- dissectorname:
    type: dissector
    specific_rules:
        - quickstart/exampledata/rules/dissector/specific/
    generic_rules:
        - quickstart/exampledata/rules/dissector/generic/

- dropper:
    type: dropper
    specific_rules:
        - quickstart/exampledata/rules/dropper/specific
    generic_rules:
        - quickstart/exampledata/rules/dropper/generic
        - filter: "test_dropper"
        dropper:
            drop:
            - drop_me
        description: "..."

- pre_detector:
    type: pre_detector
    specific_rules:
        - quickstart/exampledata/rules/pre_detector/specific
    generic_rules:
        - quickstart/exampledata/rules/pre_detector/generic
    outputs:
        - opensearch: sre
    tree_config: quickstart/exampledata/rules/pre_detector/tree_config.json
    alert_ip_list_path: quickstart/exampledata/rules/pre_detector/alert_ips.yml

- amides:
    type: amides
    specific_rules:
        - quickstart/exampledata/rules/amides/specific
    generic_rules:
        - quickstart/exampledata/rules/amides/generic
    models_path: quickstart/exampledata/models/model.zip
    num_rule_attributions: 10
    max_cache_entries: 1000000
    decision_threshold: 0.32

- pseudonymizer:
    type: pseudonymizer
    pubkey_analyst: quickstart/exampledata/rules/pseudonymizer/example_analyst_pub.pem
    pubkey_depseudo: quickstart/exampledata/rules/pseudonymizer/example_depseudo_pub.pem
    regex_mapping: quickstart/exampledata/rules/pseudonymizer/regex_mapping.yml
    hash_salt: a_secret_tasty_ingredient
    outputs:
        - opensearch: pseudonyms
    specific_rules:
        - quickstart/exampledata/rules/pseudonymizer/specific/
    generic_rules:
        - quickstart/exampledata/rules/pseudonymizer/generic/
    max_cached_pseudonyms: 1000000

- calculator:
    type: calculator
    specific_rules:
        - filter: "test_label: execute"
        calculator:
            target_field: "calculation"
            calc: "1 + 1"
    generic_rules: []

The options under input, output and pipeline are passed to factories in Logprep. They contain settings for each separate processor and connector. Details for configuring connectors are described in Output and Input and for processors in Processors.

It is possible to use environment variables in all configuration and rule files in all places. Environment variables have to be set in uppercase and prefixed with LOGPREP_, GITHUB_, PYTEST_ or CI_. Lowercase variables are ignored. Forbidden variable names are: ["LOGPREP_LIST"], as it is already used internally.

Security Best Practice - Configuration Environment Variables

As it is possible to replace all configuration options with environment variables it is recommended to use these especially for sensitive information like usernames, password, secrets or hash salts. Examples where this could be useful would be the key for the hmac calculation (see input > preprocessing) or the user/secret for the elastic-/opensearch connectors.

The following config file will be valid by setting the given environment variables:

pipeline.yml config file with environment variables

version: $LOGPREP_VERSION
process_count: $LOGPREP_PROCESS_COUNT
timeout: 0.1
logger:
    level: $LOGPREP_LOG_LEVEL
$LOGPREP_PIPELINE
$LOGPREP_INPUT
$LOGPREP_OUTPUT

setting the bash environment variables

export LOGPREP_VERSION="1"
export LOGPREP_PROCESS_COUNT="1"
export LOGPREP_LOG_LEVEL="DEBUG"
export LOGPREP_PIPELINE="
pipeline:
    - labelername:
        type: labeler
        schema: quickstart/exampledata/rules/labeler/schema.json
        include_parent_labels: true
        specific_rules:
            - quickstart/exampledata/rules/labeler/specific
        generic_rules:
            - quickstart/exampledata/rules/labeler/generic"
export LOGPREP_OUTPUT="
output:
    kafka:
        type: confluentkafka_output
        topic: producer
        error_topic: producer_error
        flush_timeout: 30
        send_timeout: 2
        kafka_config:
            bootstrap.servers: localhost:9092"
export LOGPREP_INPUT="
input:
    kafka:
        type: confluentkafka_input
        topic: consumer
        offset_reset_policy: smallest
        kafka_config:
            bootstrap.servers: localhost:9092
            group.id: test"

class logprep.util.configuration.Configuration

the configuration class

version: str: It is optionally possible to set a version to your configuration file which can be printed via logprep run --version config/pipeline.yml. This has no effect on the execution of logprep and is merely used for documentation purposes. Defaults to unset.

config_refresh_interval: int | None: Configures the interval in seconds on which logprep should try to reload the configuration. If not configured, logprep won’t reload the configuration automatically. If configured the configuration will only be reloaded if the configuration version changes. If http errors occurs on configuration reload config_refresh_interval is set to a quarter of the current config_refresh_interval until a minimum of 5 seconds is reached. Defaults to None, which means that the configuration will not be refreshed.

Security Best Practice - Configuration Refresh Interval

The refresh interval for the configuration shouldn’t be set too high in production environments. It is suggested to not set a value higher than 300 (5 min). That way configuration updates are propagated fairly quickly instead of once a day.

It should also be noted that a new configuration file will be read as long as it is a valid config. There is no further check to ensure credibility.

In case a new configuration could not be retrieved successfully and the config_refresh_interval is already reduced automatically to 5 seconds it should be noted that this could lead to a blocking behavior or an significant reduction in performance as logprep is often retrying to reload the configuration. Because of that ensure that the configuration endpoint is always available.

process_count: int: Number of logprep processes to start. Defaults to 1.

timeout: float: Logprep tries to react to signals (like sent by CTRL+C) within the given time. The time taken for some processing steps is not always predictable, thus it is not possible to ensure that this time will be adhered to. However, Logprep reacts quickly for small values (< 1.0), but this requires more processing power. This can be useful for testing and debugging. Larger values (like 5.0) slow the reaction time down, but this requires less processing power, which makes in preferable for continuous operation. Defaults to 5.0.

logger: LoggerConfig

Logger configuration.

class LoggerConfig

The logger config class used in Configuration. The schema for this class is derived from the python logging module: https://docs.python.org/3/library/logging.config.html#dictionary-schema-details

LoggerConfig.level: str: The log level of the root logger. Defaults to INFO.

Security Best Practice - Logprep Log-Level

The log level of the root logger should be set to INFO or higher in production environments to avoid exposing sensitive information in the logs.

LoggerConfig.format: str

The format of the log message as supported by the LogprepFormatter. Defaults to "%(asctime)-15s %(name)-10s %(levelname)-8s: %(message)s".

class LogprepFormatter

A custom formatter for logprep logging with additional attributes.

The Formatter can be initialized with a format string which makes use of knowledge of the LogRecord attributes - e.g. the default value mentioned above makes use of the fact that the user’s message and arguments are pre- formatted into a LogRecord’s message attribute. The available attributes are listed in the python documentation . Additionally, the formatter provides the following logprep specific attributes:

attribute	description
%(hostname)	(Logprep specific) The hostname of the machine where the log was emitted

LoggerConfig.datefmt: str: The date format of the log message. Defaults to "%Y-%m-%d %H:%M:%S".

LoggerConfig.loggers: dict

The loggers loglevel configuration. Defaults to:

root	INFO
filelock	ERROR
urllib3.connectionpool	ERROR
elasticsearch	ERROR
opensearch	ERROR
uvicorn	INFO
uvicorn.access	INFO
uvicorn.error	INFO

You can alter the log level of the loggers by adding them to the loggers mapping like in the example. Logprep opts out of hierarchical loggers and so it is possible to set the log level in general for all loggers in the root logger to INFO and then set the log level for specific loggers like Runner to DEBUG to get only DEBUG Messages from the Runner instance.

If you want to silence other loggers like py.warnings you can set the log level to ERROR here.

Example of a custom logger configuration

logger:
    level: ERROR
    format: "%(asctime)-15s %(hostname)-5s %(name)-10s %(levelname)-8s: %(message)s"
    datefmt: "%Y-%m-%d %H:%M:%S"
    loggers:
        "py.warnings": {"level": "ERROR"}
        "Runner": {"level": "DEBUG"}

input: dict: Input connector configuration. Defaults to {}. For detailed configurations see Input.

output: dict: Output connector configuration. Defaults to {}. For detailed configurations see Output.

pipeline: list[dict]: Pipeline configuration. Defaults to []. See Processors for a detailed overview on how to configure a pipeline.

metrics: MetricsConfig

Metrics configuration. Defaults to {"enabled": False, "port": 8000, "uvicorn_config": {}}.

The key uvicorn_config can be configured with any uvicorn config parameters. For further information see the uvicorn documentation.

Security Best Practice - Metrics Configuration

Additionally to the below it is recommended to configure ssl on the metrics server endpoint

metrics:
  enabled: true
  port: 9000
  uvicorn_config:
    access_log: true
    server_header: false
    date_header: false
    workers: 1

profile_pipelines: bool: Start the profiler to profile the pipeline. Defaults to False.