# 2. Preprocessing & Process Discovery Configuration

The preprocessing and process model generation is directly executed after the [Log Import](1_Import_Event_Log.md) with a
default configuration.
However, it can be triggered again with a new configuration that can be defined in this step.
After submitting the configuration form the log is preprocessed again.
The resulting [Process Models](3_Process_Models.md) can be viewed in the third step.

![img.png](img/pm_config.gif)

## Configuration

This form defines custom parameters for the preprocessing and the process discovery algorithm.

### Preprocessing

#### Split attributes

The `Split Attributes` are used to split each original trace into multiple traces, which are then are used to
create the sub logs.
For instance, a trace with 3 social documents (`spm:sdid`) is split into three traces if the attribute `spm:sdid` is
used as split attribute.

#### Similarity attributes

The `Similarity Attributes` define event attributes that are used to combine traces into sub logs.
Traces that (1) have at least two events with equal attribute values and (2) are overlapping in their timeframe are
merged into sub-logs.

#### Timestamp overlap delta
The delta that is added to the timestamps of events so that the allowed overlapping timeframe of traces (for combining
them into sub-logs) is extended.

#### Minimum Number of Resources

Define the minimum number of resources `n` so that all sublogs with less than `n` resources are dropped.

#### Artificial start and end activities

In each trace in a created sublog, an artificial start and end event can be included.
E.g., the case `A->B->C` becomes `Start->A->B->C->End`.
This behaviour can be configured.

### Process discovery

Currently, the *Collaboration Instance Miner* (default), the *DFG-Miner*, or the *Heuristic-Miner* can be selected and
used for the process instance model discovery.
For the Heuristic Miner, the parameters `Dependency Threshold`, `And Threshold`, and `Loop Two Threshold` can be
defined [1].

<sup>[1] For details see _Weijters, A. & Aalst, Wil & Medeiros, Alves. (2006). Process Mining with the Heuristics
Miner-algorithm._</sup>

---
> **_NOTE:_** The *Collaboration Pattern Detection Frameworks* uses the [pm4py](https://pm4py.fit.fraunhofer.de/)
> library for the discovery algorithms.
---

#### Collaboration Process Instance Miner

The *Collaboration Process Instance Miner* is based on the DFG miner with additional features:

- Categorical attributes can be used as additional part of a composed activity. Thereby the category is converted into
  an incremented label. E.g., the *Resource* attribute can be used. The first *Resource* becomes a "Resource 1", etc.
  This feature ensures that the subsequent pattern recognition step can recognize patterns based on the labels (i.e., it is
  irrelevant whether user "A" starts a discussion or user "B").
- For each sub-log (process instance), the corresponding cases are merged into single long traces (sorted by the timestamps).
  This feature ensures that the handoff between different social documents (`spm:sdid`) is visible.