I recently came across a data source that used multi-sheets within an Excel file. My dashboard cannot read a multi-sheet Excel file so I needed to combine them into one sheet.
The file is being uploaded into an S3 bucket and then needs to move through the data lake to be read into the dashboard. The final version of this script will be a Lambda function that is triggered on upload of the file, concatenate the sheets, and then place a new file into the next layer of the data lake.
Using Pandas you can easily accomplish this task. One issue I did run into is that Pandas no longer will read XLSX files so I did have to convert it down into an XLS file which is easily done through Excel. In the future this will also have to be done programmatically. Let’s get into the code.
import pandas as pd
workbook = pd.ExcelFile('Yourfile.XLS')
sheets = ['create', 'a', 'list']
dataframe = []import pandas as pd
workbook = pd.ExcelFile('file.xls')
sheets = ['create', 'a', 'list']
dataframe = []
for sheet in sheets:
df = pd.read_excel(workbook, sheet_name=sheet, skiprows=[list of rows to skip], skipfooter=number_of_rows_to_skip_from_bottom)
df.columns = ['list', 'of', 'column', 'headers']
dataframe.append(df)
df = pd.concat(dataframe)
df.to_excel("output.xls", index=False)
To start we are going to import the Pandas library and then read in our Excel file. In the future revision of this script I will be reading in the file from S3 through the Lambda event so this will need to change.
The “sheets” variable is a list of sheets that you want the script to look at. You can remove this if you want it to look at all the sheets. My file had a few sheets that could be ignored. We will also create an empty list called “dataframe”. This empty list will be used to store each of the sheets that we want to concatenate. In the production version of this script there is some modifications that need to be done on each sheet. I accomplished this by adding in “if/then” statements based on the sheet name.
At the end of the “for” loop we will append the data frame into our empty list. Once all the sheets have been added, we will use Pandas to concatenate the objects and output the file. You can specify your output file name. I also included the “index=false” which removes the first column of index numbers. This is not needed for my project.
So there you have it, a simple Python script to concatenate a multi-sheet Excel file. If this script helps you please share it with your network!
Leave a Reply