Logo
Afbeelding

Pandas for Auditors

Written by The Audit Analytics | 4 minutes

As an auditor, you work with large amounts of data these days. As mentioned in other articles, I recommend Python for this. Python comes with a lot of libraries to use. One of the most powerful libraries is 'Pandas,' specifically designed for data analysis. In this article, we’ll show you some useful functions. Don’t worry if you have no programming experience – we’ll keep it practical and simple!

Installing Python

What is a DataFrame?

A DataFrame is similar to an Excel-spreadsheet and can be seen as a table within Python. It consists of rows and columns and helps you analyze data in a structured way.

You can easily create a DataFrame yourself:

import pandas as pd

# Create a simple dataset
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'Salary': [50000, 60000, 70000]}

# Create DataFrame
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age  Salary
0    Alice   25  50000
1      Bob   30  60000
2  Charlie   35  70000

This is a simple table with three columns: Name, Age, and Salary.

Most of the times, you will probably not come up with data yourself, so below you will find how to import data:

# Read a CSV file
df_csv = pd.read_csv('file.csv')

# Read an Excel file
df_excel = pd.read_excel('file.xlsx')

# Read a text file (with tab delimiter)
df_txt = pd.read_csv('file.txt', delimiter='	')

Useful Pandas Functions for Auditors

In coming sections I will show you some examples of how and why to use some common functions of pandas.

1. Quickly Explore Data

Before analyzing a dataset, you’ll want to understand its structure. Use these commands:

df.head()  # Shows the first 5 rows
df.info()  # Displays information about the column data types
df.describe()  # Provides statistics such as mean, min, and max

2. Filtering Data

Sometimes, you need a subset of data that meets certain conditions. This is easy with filtering.

high_salaries = df[df['Salary'] > 55000]
print(high_salaries)

The high_salaries variable is also a DataFrame, so you can apply the same functions to it.

3. Finding Duplicate Values

Want to check if there are duplicate invoice numbers in a dataset?

duplicates = df[df.duplicated(['InvoiceNumber'], keep=False)]
print(duplicates)

4. Counting Values

Which departments appear most frequently in a dataset?

departments['Department'].value_counts()

5. Grouping Data with Groupby (Pivot Table Functionality)

Pivot tables are widely used in Excel, but you can generate them easily with Python. Want to quickly calculate totals or averages per category, like the average salary per department? Use the groupby function:

average_salary_per_department = df.groupby('Department')['Salary'].mean()
print(average_salary_per_department)

This groups the data by the 'Department' column and calculates the average salary per group. You can also add multiple calculations:

df.groupby('Department').agg({'Salary': ['mean', 'sum', 'count']})

This provides not just the mean, but also the total and count of employees per department.

6. Exporting Data to Excel

After your analysis, you’ll often want to share the results in Excel. Here’s how:

df.to_excel('audit_results.xlsx', index=False)

How to Combine Datasets? (Merge)

Now for something more advanced. Sometimes, data is spread across multiple tables, and you need to merge them. Pandas has the merge function, similar to VLOOKUP in Excel, but much more powerful.

Suppose you have one dataset with employees and another with their departments:

# First dataset: employees
employees = pd.DataFrame({
    'ID': [1, 2, 3],
    'Name': ['Alice', 'Bob', 'Charlie']
})

# Second dataset: departments
departments = pd.DataFrame({
    'ID': [1, 2, 3],
    'Department': ['Finance', 'IT', 'Audit']
})

# Merge the datasets on the ID column
result = pd.merge(employees, departments, on='ID')
print(result)

Output:

   ID     Name  Department
0   1   Alice  Finance
1   2     Bob       IT
2   3  Charlie   Audit

With pd.merge(), you can easily combine datasets based on a common column, such as an ID number. This saves a lot of time when analyzing audit data! If you need to match two differently named columns, use the left_on and right_on arguments:

result = pd.merge(employees, departments, left_on='Employee_ID', right_on='EM_ID')

Conclusion

Well, that was a super-fast Pandas crash course! But I hope you now see how useful it is. Want to learn more? Start with small datasets and try running your own analyses. You’ll soon discover that Pandas not only makes work easier but also much more enjoyable!