# Frequent Subgraph Mining After generating multiple process graphs with the [Process Mining](processMining.html) module, the resulting graphs can be processed using **Frequent Subgraph Mining** to extract patterns. To generate the frequent sub-graphs (the patterns), the following code can be used. ## Collaboration Process Instance Subgraph Mining (CPD / COPISUM) This module extracts frequent collaboration patterns from graphs created by the [Collaboration Process Instance Miner](processMining.html#collaboration-process-instance-miner). ### Algorithm Overview The algorithm (called **COPISUM** - Collaboration Process Instance Subgraph Mining) works as follows: 1. **Candidate Generation**: Identifies all connected subgraphs with activity nodes between `min_node` and `max_node` count 2. **Pattern Matching**: Compares candidates across different traces using one of several matching algorithms 3. **Frequency Counting**: Counts exact and relaxed matches to determine support 4. **Result Extraction**: Returns subgraphs that meet the minimum support thresholds ### Key Features - **Relaxed Pattern Matching**: Patterns can match even when they are not identical, based on a similarity threshold. This allows finding similar collaboration patterns that may have minor differences in node labels. - **Object Node Preservation**: Object nodes (e.g., resources, documents) and their relationships to activities are preserved in the discovered patterns, enabling analysis of the contextual context in which activities occur. - **Flexible Matching Algorithms**: Multiple algorithms available: - `GraphEditDistance`: Exact or approximate matching based on edit distance - `CosineGraphSimilarity`: Distribution-based similarity matching - `VF2Isomorphism`: Strict isomorphism detection - **Activity-Centric**: Focuses on activity node connectivity while treating object nodes as contextual attributes ### Pattern Matching Algorithms | Algorithm | Description | Threshold Range | |------|-----------|------| | `GraphEditDistance` | Graph Edit Distance (exact node/edge matching) | 0 = Exact, >0 for approximate | | `GraphkitLearnGraphEditDistance` | GLEAD algorithm for larger graphs | 0 = Exact, >0 for approximate | | `CosineGraphSimilarity` | Cosine similarity on vertex/edge distributions | 0-1 (1 = exact match) | | `VF2Isomorphism` | VF2 isomorphism detection | Exact only | ### Parameters The `execute_copisum` function accepts the following parameters: | Parameter | Type | Default | Description | |------|---|---------|--| | `graph_collection` | GraphCollection | - | The baseline graph collection | | `min_node` | int | 3 | Minimum number of activity nodes | | `max_node` | int | 4 | Maximum number of activity nodes | | `object_attributes` | Set[str] | `None` | Set of object attribute types to include | | `activity_node_type` | str | `"activity"` | Type name for activity nodes | | `node_type_metadata_attribute` | str | `"v_type"` | Metadata attribute for vertex types | | `matching_threshold` | float/int | 0 | Threshold for pattern matching | | `support_exact` | int | 2 | Minimum support for exact matches | | `support_relaxed` | int | 2 | Minimum support for relaxed matches | | `subgraph_mining` | SubgraphMiningAlgorithm | `RustImpl` | Algorithm implementation | | `pattern_matching_algorithm` | PatternMatchingAlgorithm | `CosineGraphSimilarity` | Matching algorithm | ### Example Usage Basic usage with default settings: ```python from collaboration_detection.datastructures import GraphCollection from collaboration_detection.subgraph_mining import copisum # Load or mine graphs graph_collection = GraphCollection().load("graphs.db") # Execute COPISUM patterns = copisum( graph_collection=graph_collection, min_node=3, max_node=5, support_exact=2, support_relaxed=2 ) # Patterns are now in the graph_collection for graph_id, graph in patterns.graphs.items(): print(f"Pattern {graph_id}: support={graph.metadata.get('support', 0)}") ``` Advanced usage with custom configuration: ```python from collaboration_detection.datastructures import GraphCollection from collaboration_detection.subgraph_mining import copisum from collaboration_detection.subgraph_mining.collaboration_process_instance_subgraph_mining.algorithm import ( SubgraphMiningAlgorithm, PatternMatchingAlgorithm, ) graph_collection = GraphCollection().load("graphs.db") patterns = copisum( graph_collection=graph_collection, min_node=2, max_node=6, object_attributes={"org:resource", "spm:sdid"}, activity_node_type="activity", node_type_metadata_attribute="v_type", matching_threshold=0.9, support_exact=3, support_relaxed=2, subgraph_mining=SubgraphMiningAlgorithm.RustImpl, pattern_matching_algorithm=PatternMatchingAlgorithm.CosineGraphSimilarity ) ``` Using Graph Edit Distance for exact matching: ```python patterns = copisum( graph_collection=graph_collection, min_node=3, max_node=5, matching_threshold=0, # 0 = exact match for GED pattern_matching_algorithm=PatternMatchingAlgorithm.GraphEditDistance, support_exact=2 ) ``` ### Rust Implementation The default implementation uses a Rust binary (`cpd`) for efficient subgraph mining. The binary is included for Linux, macOS (Intel/ARM), and Windows. The algorithm automatically selects the correct binary based on the platform. ## gSpan (Exact Frequent Subgraph Mining) The [gSpan](https://sites.cs.ucsb.edu/~xyan/software/gSpan.htm) algorithm is used. This package requires the [gSpan Java implementation](https://github.com/joleaf/gSpan.Java). ```python from collaboration_detection.datastructures import GraphCollection from collaboration_detection.subgraph_mining import gspan # 1. create a graph collection (import or by mining new graphs) g_c = GraphCollection().load("graphs.db") # 2. Execute the gspan algorithm gspan( data=g_c, min_node=3, max_node=10, sup=20, remove_data_graphs=True ) # 3. All sub-graphs are now stored in the graph collection g_c.graphs.items() ```