Tag: pandas

  • Pandas & NumPy with AWS Lambda

    Fun fact: Pandas and NumPy don’t work out of the box with Lambda. The libraries that you might download from your development machine probably won’t work either.

    The standard Lambda Python environment is very barebones by default. There is no point in loading in a bunch of libraries if they aren’t needed. This is why we package our Lambda functions into ZIP files to be deployed.

    My first time attempting to use Pandas on AWS Lambda was in regards to concatenating Excel files. The point of this was to take a multi-sheet Excel file and combine it into one sheet for ingestion into a data lake. To accomplish this I used the Pandas library to build the new sheet. In order to automate the process I setup an S3 trigger on a Lambda function to execute the script every time a file was uploaded.

    And then I ran into this error:

    [ERROR] Runtime.ImportModuleError: Unable to import module 'your_module':
    IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
    Importing the numpy c-extensions failed.

    I had clearly added the NumPy library into my ZIP file:

    So what was the problem? Well, apparently, the version of NumPy that I downloaded on both my Macbook and my Windows desktop is not compatible with Amazon Linux.

    To resolve this issue, I first attempted to download the package files manually from PyPi.org. I grabbed the latest “manylinux1_x86_x64.whl” file for both NumPy and Pandas. I put them back into my ZIP file and re-uploaded the file. This resulted in the same error.

    THE FIX THAT WORKED:

    The way to get this to work without failure is to spin up an Amazon Linux EC2 instance. Yes this seems excessive and it is. Not only did I have to spin up a new instance I had to install Python 3.8 because Amazon Linux ships with Python 2.7 by default. But, once installed you can use Pip to install the libraries to a directory by doing:

    pip3 install -t . <package name>

    This is useful for getting the libraries in the same location to ZIP back up for use. You can remove a lot of the files that are not needed by running:

    rm -r *.dist-info __pycache__

    After you have done the cleanup, you can ZIP up the files and move them back to your development machine, add your Lambda function and, upload to the Lambda console.

    Run a test! It should work as you intended now!

    If you need help with this please reach out to me on social media or leave a comment below.

  • Concatenating Multi-Sheet Excel Files with Python

    I recently came across a data source that used multi-sheets within an Excel file. My dashboard cannot read a multi-sheet Excel file so I needed to combine them into one sheet.

    The file is being uploaded into an S3 bucket and then needs to move through the data lake to be read into the dashboard. The final version of this script will be a Lambda function that is triggered on upload of the file, concatenate the sheets, and then place a new file into the next layer of the data lake.

    Using Pandas you can easily accomplish this task. One issue I did run into is that Pandas no longer will read XLSX files so I did have to convert it down into an XLS file which is easily done through Excel. In the future this will also have to be done programmatically. Let’s get into the code.

    import pandas as pd
    
    workbook = pd.ExcelFile('Yourfile.XLS')
    sheets = ['create', 'a', 'list']
    dataframe = []import pandas as pd
    
    workbook = pd.ExcelFile('file.xls')
    sheets = ['create', 'a', 'list']
    dataframe = []
    
    for sheet in sheets:
        df = pd.read_excel(workbook, sheet_name=sheet, skiprows=[list of rows to skip], skipfooter=number_of_rows_to_skip_from_bottom)
        df.columns = ['list', 'of', 'column', 'headers']
        dataframe.append(df)
    df = pd.concat(dataframe)
    df.to_excel("output.xls", index=False)
    

    To start we are going to import the Pandas library and then read in our Excel file. In the future revision of this script I will be reading in the file from S3 through the Lambda event so this will need to change.

    The “sheets” variable is a list of sheets that you want the script to look at. You can remove this if you want it to look at all the sheets. My file had a few sheets that could be ignored. We will also create an empty list called “dataframe”. This empty list will be used to store each of the sheets that we want to concatenate. In the production version of this script there is some modifications that need to be done on each sheet. I accomplished this by adding in “if/then” statements based on the sheet name.

    At the end of the “for” loop we will append the data frame into our empty list. Once all the sheets have been added, we will use Pandas to concatenate the objects and output the file. You can specify your output file name. I also included the “index=false” which removes the first column of index numbers. This is not needed for my project.

    So there you have it, a simple Python script to concatenate a multi-sheet Excel file. If this script helps you please share it with your network!