Isolation Forest: Detecting Anomalies Without Business Rules

In this article, we introduce Isolation Forest, an algorithm that helps you search for anomalies in a very understandable way. It doesn't look for specific errors but identifies observations that differ from the rest, regardless of the reason. A powerful tool for risk analysis and finding the "unknown unknowns".

Isolation Forest

The Concept

The name "Isolation Forest" already gives away how it works. The algorithm tries to isolate every data point (for example, a transaction) from the rest.

Imagine picking a random transaction from your dataset. To isolate it, the algorithm asks random 'yes/no' questions about the data, for example: "Is the amount higher than €500?" or "Is the invoice date after the 15th?".

Normal transactions look very similar to each other and form a dense cluster; you need many questions to distinguish one specific normal transaction from the mass. Anomalies, on the other hand, have unique properties and lie, figuratively speaking, at the edge of the dataset. Often, very few questions are needed to isolate such a point. The core rule is simple: the faster a transaction is isolated, the higher the chance it is an anomaly.

The Difference with Other Analyses

In audit practice, we rely heavily on expectations. With a reconciliation, we check if 'Balance A' equals 'Balance B'. With a substantive analytical procedure, we verify if 'Volume x Price' matches 'Revenue'. And in other data analyses, we often write queries based on specific criteria (e.g., "Segregation of Duties" conflicts or "amounts above €10,000").

The common denominator in all these methods is that you define beforehand what you are looking for. You validate an existing expectation.

Isolation Forest turns this around. It is an agnostic method. It doesn't check if a booking complies with your rules but looks purely at the data itself. It finds patterns that you as an auditor might never have thought of. For example: a specific combination of a general ledger account and a cost center that statistically never occurs, even if the amount is low and falls within budget. Where traditional analyses "verify", with Isolation Forest you are "exploring".

For Data Scientists: The Link with Random Forest

You might already be familiar with Random Forest, a popular algorithm for predictions. Isolation Forest shares its DNA with this technique; both methods use a collection ('forest') of decision trees to reach a result. The fundamental difference lies in the goal. Where Random Forest is 'supervised' and learns from labeled examples from the past (e.g., "these were fraudulent transactions"), Isolation Forest works 'unsupervised'. It needs no training data or historical errors. It uses the structure of the trees solely to measure how 'isolated' a data point lies.

When to Use?

Because Isolation Forest looks for deviations from the norm, it only works well in populations where a "norm" or pattern actually exists. The technique is particularly relevant for:

Journal Entry Testing (JET): The enormous pile of journal entries contains many repetitive processes (depreciation, payroll entries). A manual correction that deviates from this pattern (different amount, different account combination) immediately floats to the surface.
Master Data Analysis: Analyze changes in creditor or debtor master data. A change in itself is not bad, but it is if it's executed at a time or by a user that deviates from the normal management process.
Logistics Movements: In goods movements, the relationship between weight, count, and product type is often strongly correlated. Isolation Forest can pick out shipments that "don't add up" (e.g., very heavy but small volume), which could indicate entry errors or potentially misappropriation.

Python Implementation

Let's see how this works in Python. In this example, we simply use the 'Amount' column and the 'General Ledger Account'. We don't need to figure out ourselves which amounts belong to which account; the model establishes that relationship itself. It is important that you have the 'pandas' and 'scikit-learn' packages installed. More about installing Python can be found here.

import pandas as pd
from sklearn.ensemble import IsolationForest

# 1. Load Data
df = pd.read_csv("journal_entries.csv")

# 2. Feature Selection
# We use 'raw' data. The model will discover itself that 
# high amounts are normal on account 4000 (Revenue), 
# but suspicious on account 4500 (Canteen expenses).
features = [
    "Amount",               
    "General_Ledger_Account_Nr"  
]

# Isolation Forest cannot handle empty values
X = df[features].fillna(0)

# 3. Train the Model
# contamination=0.01: we focus on the top 1% outliers
# random_state=42: ensures reproducible results
model = IsolationForest(
    n_estimators=100,
    contamination=0.01,
    random_state=42
)

model.fit(X)

# 4. Assign Scores
# decision_function gives a score: the lower, the more anomalous
df["score"] = model.decision_function(X)

# 5. Show Results
# We sort by score to get the biggest anomalies at the top
anomalies = df.sort_values("score").head(10)

print(anomalies)

The Boundaries

Although Isolation Forest can be incredibly useful, it remains a statistical model without substantive professional knowledge. The algorithm might label a booking of one cent as the biggest deviation in the dataset simply because the amount is unique, while such an item is rarely materially relevant for the financial statement audit. As an auditor, you must therefore always filter the results yourself based on materiality and professional relevance.

Moreover, the model is context-blind; a legitimate but one-off event, such as the purchase of a company building, will statistically always stand out. The algorithm does not tell you why something deviates, meaning the interpretation and the final judgment always remain with the auditor.

Is it a "Black Box"?

With some algorithms, you run into a "black box" or hallucinations, where a result is not really traceable. However, this is not the case with Isolation Forest. Because the technique is based on transparent decision trees, every score can be mathematically explained. Furthermore, the analysis is fully reproducible by using a so-called 'seed' (like the 'random_state' in the code example). This ensures that the random splits of the algorithm are exactly the same every time, which is crucial for the auditability of your file.

In the documentation for your audit file, you therefore do not describe this algorithm as a replacement for your judgment, but as an advanced selection method that helps direct your attention. By recording the used parameters and the specific 'random_state', you make the way you arrived at your 'exceptions' fully transparent for a reviewer or supervisor. Additionally, this is of course mainly for support, alongside all other analyses you can perform.

Up to you!

This was a short introduction to the 'Isolation Forest' algorithm. Grab the example script and try it out yourself!

Isolation Forest: Detecting Anomalies Without Business Rules

The Concept

The Difference with Other Analyses

For Data Scientists: The Link with Random Forest

When to Use?

Python Implementation

The Boundaries

Is it a "Black Box"?

Up to you!

Was this article helpful?

Related Articles

Benford’s Law in Audit Practice: Identifying Irregular Number Patterns

K-means clustering in auditing: clustering & outliers