What do you need?
We will use Python and only the external packages pandas and numpy. If you haven't installed them yet, you can do so with the following command:
pip install pandas numpy
Additionally, we will work with dummy data, which is convenient because we can easily tweak it. However, the fields we use are quite common. To make it as realistic as possible, we will start with two files:
We will need to load these files, which we do as follows:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Load datasets
invoices = pd.read_csv("dummy_invoices.csv", parse_dates=["Issue Date"])
vendors = pd.read_csv("dummy_vendors.csv", parse_dates=["CreationDate"])
1. Identifying Suppliers with the Largest Expenditures
A common analysis is determining which suppliers generate the highest expenses. For example, you can discover if 20% of the suppliers are responsible for 80% of the costs.
Steps:
- Sum
InvoiceAmount
per supplier. - Sort the results in descending order of total amount.
- Optionally, calculate a cumulative percentage to determine the "top X%."
# Sum invoice amounts per supplier
spend_per_vendor = invoices.groupby('VendorID')['InvoiceAmount'].sum().reset_index()
# Sort in descending order of expenses
spend_per_vendor = spend_per_vendor.sort_values(by='InvoiceAmount', ascending=False)
# Add supplier name (VendorName) to the DataFrame
spend_per_vendor = spend_per_vendor.merge(vendors[['VendorID', 'VendorName']], on='VendorID', how='left')
# Calculate cumulative percentage
total_spend = spend_per_vendor['InvoiceAmount'].sum()
spend_per_vendor['CumulativePercent'] = (spend_per_vendor['InvoiceAmount'].cumsum() / total_spend * 100)
print("Expenses per supplier (descending order):")
print(spend_per_vendor)
Interpretation:
You can quickly see which suppliers account for the majority of the costs, helping you identify the largest risks or cost-saving opportunities.
2. First Invoices from New Suppliers
It can be interesting to examine suppliers that were recently added to the system (e.g., created within the last X months) and analyze how quickly and how often they invoice.
Steps:
- Define what "new" means. In this example, we define a supplier as new if the
CreationDate
is less than 90 days ago. - Merge invoices with master data.
- Filter invoices from these new suppliers and analyze frequency and total amount.
# Set current date for analysis
today = datetime(2025, 2, 1)
# Label suppliers as 'new' if created less than 90 days ago
vendors['IsNew'] = vendors['CreationDate'].apply(lambda x: (today - x).days < 90)
# Merge invoice data with supplier data
invoice_details = invoices.merge(vendors, on='VendorID', how='left')
# Filter for new suppliers
new_vendors_invoices = invoice_details[invoice_details['IsNew']]
# Frequency and total amount per new supplier
analysis_new_vendors = new_vendors_invoices.groupby('VendorName').agg({
'InvoiceID': 'count',
'InvoiceAmount': 'sum'
}).rename(columns={'InvoiceID': 'InvoiceCount'})
print("Payments to new suppliers (<90 days in system):")
print(analysis_new_vendors)
Interpretation:
Now you can see how often (and for what amount) you have paid new suppliers. Based on this information, you can recognize risk scenarios, such as a new supplier immediately invoicing a high amount.
3. Suppliers Submitting Many Invoices or High Amounts in a Short Period
To detect fraud or unusual patterns, you can analyze how invoices accumulate over time per supplier.
Steps:
- Merge invoice data with supplier master data.
- Group by supplier and a fixed period (e.g., per month or per week).
- Count the number of invoices and/or sum the amounts for that period.
- Identify abnormal spikes (e.g., sudden increases).
# Add month information for grouping
invoice_details['InvoiceMonth'] = invoice_details['Issue Date'].dt.to_period('M')
# Group by supplier and month
monthly_stats = invoice_details.groupby(['VendorID', 'InvoiceMonth']).agg({
'InvoiceID': 'count',
'InvoiceAmount': 'sum'
}).reset_index()
# Example: Mark suppliers exceeding a threshold in a given month
threshold_amount = 10000 # Arbitrary threshold, adjust as needed
monthly_stats['HighSpending'] = monthly_stats['InvoiceAmount'] > threshold_amount
print("Monthly invoice statistics per supplier:")
print(monthly_stats)
# Optionally merge supplier names for readability
monthly_stats = monthly_stats.merge(vendors[['VendorID', 'VendorName']], on='VendorID', how='left')
print("Monthly invoice statistics with supplier names:")
print(monthly_stats[['VendorName', 'InvoiceMonth', 'InvoiceID', 'InvoiceAmount', 'HighSpending']])
Interpretation:
This helps identify suppliers who suddenly submit many invoices in a short time. Adjusting the threshold (or using statistical methods to detect outliers) can make this analysis more or less sensitive.
4. Comparing Historical Trends
Another way to detect anomalies is by comparing historical averages with the current period. For instance, you can see if a supplier suddenly invoices three times their usual amount.
Steps:
- Calculate total amounts per supplier per month.
- Compare the most recent month (or period) with the average of previous months.
- Define a percentage or factor to flag anomalies.
# Use the monthly_stats from the previous step
# Calculate average and standard deviation (e.g., over past n months)
historical_summary = monthly_stats.groupby('VendorID')['InvoiceAmount'].agg(['mean', 'std']).reset_index()
# Merge calculated statistics back
monthly_stats = monthly_stats.merge(historical_summary, on='VendorID', how='left', suffixes=('', '_Hist'))
# Simple rule: InvoiceAmount > mean + 2 * std
monthly_stats['SignificantIncrease'] = monthly_stats['InvoiceAmount'] > (monthly_stats['mean'] + 2 * monthly_stats['std'])
print("Comparison of current month with historical averages:")
print(monthly_stats[['VendorName', 'InvoiceMonth', 'InvoiceAmount', 'mean', 'std', 'SignificantIncrease']])
Interpretation:
This helps detect if a supplier exhibits a sudden spike. If the standard deviation is low and the current period far exceeds the average, it may indicate abnormal activity requiring further investigation.
5. Suppliers Based on Their Location
It can be useful to see where your suppliers are located and whether suppliers from unusual countries/regions suddenly appear.
Steps:
- Use the field
Region
(orCountry
). - Count the number of suppliers per region or total invoice amount per region.
- Identify regions that deviate from expectations.
# Count number of suppliers per region
region_count = vendors.groupby('Region')['VendorID'].count().reset_index().rename(columns={'VendorID': 'VendorCount'})
# Optionally, total invoice amount per region
invoices_by_region = invoices.merge(vendors, on='VendorID', how='left')
region_spend = invoices_by_region.groupby('Region')['InvoiceAmount'].sum().reset_index().rename(columns={'InvoiceAmount': 'TotalSpend'})
print("Number of suppliers per region:")
print(region_count)
print("Total spend per region:")
print(region_spend)