Frequent Subgraph Mining

After generating multiple process graphs with the Process Mining module, the resulting graphs can be processed using Frequent Subgraph Mining to extract patterns. To generate the frequent sub-graphs (the patterns), the following code can be used.

Collaboration Process Instance Subgraph Mining (CPD / COPISUM)

This module extracts frequent collaboration patterns from graphs created by the Collaboration Process Instance Miner.

Algorithm Overview

The algorithm (called COPISUM - Collaboration Process Instance Subgraph Mining) works as follows:

  1. Candidate Generation: Identifies all connected subgraphs with activity nodes between min_node and max_node count

  2. Pattern Matching: Compares candidates across different traces using one of several matching algorithms

  3. Frequency Counting: Counts exact and relaxed matches to determine support

  4. Result Extraction: Returns subgraphs that meet the minimum support thresholds

Key Features

  • Relaxed Pattern Matching: Patterns can match even when they are not identical, based on a similarity threshold. This allows finding similar collaboration patterns that may have minor differences in node labels.

  • Object Node Preservation: Object nodes (e.g., resources, documents) and their relationships to activities are preserved in the discovered patterns, enabling analysis of the contextual context in which activities occur.

  • Flexible Matching Algorithms: Multiple algorithms available:

    • GraphEditDistance: Exact or approximate matching based on edit distance

    • CosineGraphSimilarity: Distribution-based similarity matching

    • VF2Isomorphism: Strict isomorphism detection

  • Activity-Centric: Focuses on activity node connectivity while treating object nodes as contextual attributes

Pattern Matching Algorithms

Algorithm

Description

Threshold Range

GraphEditDistance

Graph Edit Distance (exact node/edge matching)

0 = Exact, >0 for approximate

GraphkitLearnGraphEditDistance

GLEAD algorithm for larger graphs

0 = Exact, >0 for approximate

CosineGraphSimilarity

Cosine similarity on vertex/edge distributions

0-1 (1 = exact match)

VF2Isomorphism

VF2 isomorphism detection

Exact only

Parameters

The execute_copisum function accepts the following parameters:

Parameter

Type

Default

Description

graph_collection

GraphCollection

The baseline graph collection

min_node

int

3

Minimum number of activity nodes

max_node

int

4

Maximum number of activity nodes

object_attributes

Set[str]

None

Set of object attribute types to include

activity_node_type

str

"activity"

Type name for activity nodes

node_type_metadata_attribute

str

"v_type"

Metadata attribute for vertex types

matching_threshold

float/int

0

Threshold for pattern matching

support_exact

int

2

Minimum support for exact matches

support_relaxed

int

2

Minimum support for relaxed matches

subgraph_mining

SubgraphMiningAlgorithm

RustImpl

Algorithm implementation

pattern_matching_algorithm

PatternMatchingAlgorithm

CosineGraphSimilarity

Matching algorithm

Example Usage

Basic usage with default settings:

from collaboration_detection.datastructures import GraphCollection
from collaboration_detection.subgraph_mining import copisum

# Load or mine graphs
graph_collection = GraphCollection().load("graphs.db")

# Execute COPISUM
patterns = copisum(
    graph_collection=graph_collection,
    min_node=3,
    max_node=5,
    support_exact=2,
    support_relaxed=2
)

# Patterns are now in the graph_collection
for graph_id, graph in patterns.graphs.items():
    print(f"Pattern {graph_id}: support={graph.metadata.get('support', 0)}")

Advanced usage with custom configuration:

from collaboration_detection.datastructures import GraphCollection
from collaboration_detection.subgraph_mining import copisum
from collaboration_detection.subgraph_mining.collaboration_process_instance_subgraph_mining.algorithm import (
    SubgraphMiningAlgorithm,
    PatternMatchingAlgorithm,
)

graph_collection = GraphCollection().load("graphs.db")

patterns = copisum(
    graph_collection=graph_collection,
    min_node=2,
    max_node=6,
    object_attributes={"org:resource", "spm:sdid"},
    activity_node_type="activity",
    node_type_metadata_attribute="v_type",
    matching_threshold=0.9,
    support_exact=3,
    support_relaxed=2,
    subgraph_mining=SubgraphMiningAlgorithm.RustImpl,
    pattern_matching_algorithm=PatternMatchingAlgorithm.CosineGraphSimilarity
)

Using Graph Edit Distance for exact matching:

patterns = copisum(
    graph_collection=graph_collection,
    min_node=3,
    max_node=5,
    matching_threshold=0,  # 0 = exact match for GED
    pattern_matching_algorithm=PatternMatchingAlgorithm.GraphEditDistance,
    support_exact=2
)

Rust Implementation

The default implementation uses a Rust binary (cpd) for efficient subgraph mining. The binary is included for Linux, macOS (Intel/ARM), and Windows. The algorithm automatically selects the correct binary based on the platform.

gSpan (Exact Frequent Subgraph Mining)

The gSpan algorithm is used. This package requires the gSpan Java implementation.

from collaboration_detection.datastructures import GraphCollection
from collaboration_detection.subgraph_mining import gspan

# 1. create a graph collection (import or by mining new graphs)
g_c = GraphCollection().load("graphs.db")
# 2. Execute the gspan algorithm
gspan(
    data=g_c,
    min_node=3,
    max_node=10,
    sup=20,
    remove_data_graphs=True
)
# 3. All sub-graphs are now stored in the graph collection
g_c.graphs.items()