What is a DataFrame?
A DataFrame is similar to an Excel-spreadsheet and can be seen as a table within Python. It consists of rows and columns and helps you analyze data in a structured way.
You can easily create a DataFrame yourself:
import pandas as pd
# Create a simple dataset
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]}
# Create DataFrame
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
This is a simple table with three columns: Name, Age, and Salary.
Most of the times, you will probably not come up with data yourself, so below you will find how to import data:
# Read a CSV file
df_csv = pd.read_csv('file.csv')
# Read an Excel file
df_excel = pd.read_excel('file.xlsx')
# Read a text file (with tab delimiter)
df_txt = pd.read_csv('file.txt', delimiter=' ')
Useful Pandas Functions for Auditors
In coming sections I will show you some examples of how and why to use some common functions of pandas.
1. Quickly Explore Data
Before analyzing a dataset, you’ll want to understand its structure. Use these commands:
df.head() # Shows the first 5 rows
df.info() # Displays information about the column data types
df.describe() # Provides statistics such as mean, min, and max
2. Filtering Data
Sometimes, you need a subset of data that meets certain conditions. This is easy with filtering.
high_salaries = df[df['Salary'] > 55000]
print(high_salaries)
The high_salaries
variable is also a DataFrame, so you can apply the same functions to it.
3. Finding Duplicate Values
Want to check if there are duplicate invoice numbers in a dataset?
duplicates = df[df.duplicated(['InvoiceNumber'], keep=False)]
print(duplicates)
4. Counting Values
Which departments appear most frequently in a dataset?
departments['Department'].value_counts()
5. Grouping Data with Groupby (Pivot Table Functionality)
Pivot tables are widely used in Excel, but you can generate them easily with Python. Want to quickly calculate totals or averages per category, like the average salary per department? Use the groupby function:
average_salary_per_department = df.groupby('Department')['Salary'].mean()
print(average_salary_per_department)
This groups the data by the 'Department' column and calculates the average salary per group. You can also add multiple calculations:
df.groupby('Department').agg({'Salary': ['mean', 'sum', 'count']})
This provides not just the mean, but also the total and count of employees per department.
6. Exporting Data to Excel
After your analysis, you’ll often want to share the results in Excel. Here’s how:
df.to_excel('audit_results.xlsx', index=False)
How to Combine Datasets? (Merge)
Now for something more advanced. Sometimes, data is spread across multiple tables, and you need to merge them. Pandas has the merge function, similar to VLOOKUP in Excel, but much more powerful.
Suppose you have one dataset with employees and another with their departments:
# First dataset: employees
employees = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
# Second dataset: departments
departments = pd.DataFrame({
'ID': [1, 2, 3],
'Department': ['Finance', 'IT', 'Audit']
})
# Merge the datasets on the ID column
result = pd.merge(employees, departments, on='ID')
print(result)
Output:
ID Name Department
0 1 Alice Finance
1 2 Bob IT
2 3 Charlie Audit
With pd.merge(), you can easily combine datasets based on a common column, such as an ID number. This saves a lot of time when analyzing audit data! If you need to match two differently named columns, use the left_on and right_on arguments:
result = pd.merge(employees, departments, left_on='Employee_ID', right_on='EM_ID')
Conclusion
Well, that was a super-fast Pandas crash course! But I hope you now see how useful it is. Want to learn more? Start with small datasets and try running your own analyses. You’ll soon discover that Pandas not only makes work easier but also much more enjoyable!