Frequent Subgraph Mining
After generating multiple process graphs with the Process Mining module, the resulting graphs can be processed using Frequent Subgraph Mining to extract patterns. To generate the frequent sub-graphs (the patterns), the following code can be used.
Collaboration Process Instance Subgraph Mining (CPD / COPISUM)
This module extracts frequent collaboration patterns from graphs created by the Collaboration Process Instance Miner.
Algorithm Overview
The algorithm (called COPISUM - Collaboration Process Instance Subgraph Mining) works as follows:
Candidate Generation: Identifies all connected subgraphs with activity nodes between
min_nodeandmax_nodecountPattern Matching: Compares candidates across different traces using one of several matching algorithms
Frequency Counting: Counts exact and relaxed matches to determine support
Result Extraction: Returns subgraphs that meet the minimum support thresholds
Key Features
Relaxed Pattern Matching: Patterns can match even when they are not identical, based on a similarity threshold. This allows finding similar collaboration patterns that may have minor differences in node labels.
Object Node Preservation: Object nodes (e.g., resources, documents) and their relationships to activities are preserved in the discovered patterns, enabling analysis of the contextual context in which activities occur.
Flexible Matching Algorithms: Multiple algorithms available:
GraphEditDistance: Exact or approximate matching based on edit distanceCosineGraphSimilarity: Distribution-based similarity matchingVF2Isomorphism: Strict isomorphism detection
Activity-Centric: Focuses on activity node connectivity while treating object nodes as contextual attributes
Pattern Matching Algorithms
Algorithm |
Description |
Threshold Range |
|---|---|---|
|
Graph Edit Distance (exact node/edge matching) |
0 = Exact, >0 for approximate |
|
GLEAD algorithm for larger graphs |
0 = Exact, >0 for approximate |
|
Cosine similarity on vertex/edge distributions |
0-1 (1 = exact match) |
|
VF2 isomorphism detection |
Exact only |
Parameters
The execute_copisum function accepts the following parameters:
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
GraphCollection |
The baseline graph collection |
|
|
int |
3 |
Minimum number of activity nodes |
|
int |
4 |
Maximum number of activity nodes |
|
Set[str] |
|
Set of object attribute types to include |
|
str |
|
Type name for activity nodes |
|
str |
|
Metadata attribute for vertex types |
|
float/int |
0 |
Threshold for pattern matching |
|
int |
2 |
Minimum support for exact matches |
|
int |
2 |
Minimum support for relaxed matches |
|
SubgraphMiningAlgorithm |
|
Algorithm implementation |
|
PatternMatchingAlgorithm |
|
Matching algorithm |
Example Usage
Basic usage with default settings:
from collaboration_detection.datastructures import GraphCollection
from collaboration_detection.subgraph_mining import copisum
# Load or mine graphs
graph_collection = GraphCollection().load("graphs.db")
# Execute COPISUM
patterns = copisum(
graph_collection=graph_collection,
min_node=3,
max_node=5,
support_exact=2,
support_relaxed=2
)
# Patterns are now in the graph_collection
for graph_id, graph in patterns.graphs.items():
print(f"Pattern {graph_id}: support={graph.metadata.get('support', 0)}")
Advanced usage with custom configuration:
from collaboration_detection.datastructures import GraphCollection
from collaboration_detection.subgraph_mining import copisum
from collaboration_detection.subgraph_mining.collaboration_process_instance_subgraph_mining.algorithm import (
SubgraphMiningAlgorithm,
PatternMatchingAlgorithm,
)
graph_collection = GraphCollection().load("graphs.db")
patterns = copisum(
graph_collection=graph_collection,
min_node=2,
max_node=6,
object_attributes={"org:resource", "spm:sdid"},
activity_node_type="activity",
node_type_metadata_attribute="v_type",
matching_threshold=0.9,
support_exact=3,
support_relaxed=2,
subgraph_mining=SubgraphMiningAlgorithm.RustImpl,
pattern_matching_algorithm=PatternMatchingAlgorithm.CosineGraphSimilarity
)
Using Graph Edit Distance for exact matching:
patterns = copisum(
graph_collection=graph_collection,
min_node=3,
max_node=5,
matching_threshold=0, # 0 = exact match for GED
pattern_matching_algorithm=PatternMatchingAlgorithm.GraphEditDistance,
support_exact=2
)
Rust Implementation
The default implementation uses a Rust binary (cpd) for efficient subgraph mining. The binary is included for Linux, macOS (Intel/ARM), and Windows. The algorithm automatically selects the correct binary based on the platform.
gSpan (Exact Frequent Subgraph Mining)
The gSpan algorithm is used. This package requires the gSpan Java implementation.
from collaboration_detection.datastructures import GraphCollection
from collaboration_detection.subgraph_mining import gspan
# 1. create a graph collection (import or by mining new graphs)
g_c = GraphCollection().load("graphs.db")
# 2. Execute the gspan algorithm
gspan(
data=g_c,
min_node=3,
max_node=10,
sup=20,
remove_data_graphs=True
)
# 3. All sub-graphs are now stored in the graph collection
g_c.graphs.items()