Category: Technology

  • Pandas & NumPy with AWS Lambda

    Fun fact: Pandas and NumPy don’t work out of the box with Lambda. The libraries that you might download from your development machine probably won’t work either.

    The standard Lambda Python environment is very barebones by default. There is no point in loading in a bunch of libraries if they aren’t needed. This is why we package our Lambda functions into ZIP files to be deployed.

    My first time attempting to use Pandas on AWS Lambda was in regards to concatenating Excel files. The point of this was to take a multi-sheet Excel file and combine it into one sheet for ingestion into a data lake. To accomplish this I used the Pandas library to build the new sheet. In order to automate the process I setup an S3 trigger on a Lambda function to execute the script every time a file was uploaded.

    And then I ran into this error:

    [ERROR] Runtime.ImportModuleError: Unable to import module 'your_module':
    IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
    Importing the numpy c-extensions failed.

    I had clearly added the NumPy library into my ZIP file:

    So what was the problem? Well, apparently, the version of NumPy that I downloaded on both my Macbook and my Windows desktop is not compatible with Amazon Linux.

    To resolve this issue, I first attempted to download the package files manually from PyPi.org. I grabbed the latest “manylinux1_x86_x64.whl” file for both NumPy and Pandas. I put them back into my ZIP file and re-uploaded the file. This resulted in the same error.

    THE FIX THAT WORKED:

    The way to get this to work without failure is to spin up an Amazon Linux EC2 instance. Yes this seems excessive and it is. Not only did I have to spin up a new instance I had to install Python 3.8 because Amazon Linux ships with Python 2.7 by default. But, once installed you can use Pip to install the libraries to a directory by doing:

    pip3 install -t . <package name>

    This is useful for getting the libraries in the same location to ZIP back up for use. You can remove a lot of the files that are not needed by running:

    rm -r *.dist-info __pycache__

    After you have done the cleanup, you can ZIP up the files and move them back to your development machine, add your Lambda function and, upload to the Lambda console.

    Run a test! It should work as you intended now!

    If you need help with this please reach out to me on social media or leave a comment below.

  • A File Management Architecture

    A File Management Architecture

    This post is a continuation of my article: “A File Extraction Project”. This project has been a great learning experience for both frontend and backend application architecture and design. Below you will find a diagram and an explanation of all the pieces that make this work.

    1. The entire architecture is powered by Flask on an EC2 instance. When I move this project to production I intend to put an application load balancer in front to manage traffic. The frontend is also secured by Google Authentication. This provides authentication against the users existing GSuite deployment so that only individuals within the organization can access the application.
    2. The first Lambda function processes the upload functions. I am allowing for as many files as needed by the customer. The form also includes a single text field for specifying the value of the object tag. The function sends the objects into the first bucket which is object #4.
    3. The second Lambda function is the search functionality. This function allows the user to provide a tag value. The function queries all objects in bucket #4 and creates a list of objects that match the query. It then moves the objects to bucket #5 where it packages them up and presents them to the user in the form of a ZIP file.
    4. The first bucket is the storage for all of the objects. This is the bucket where all the objects are uploaded to from the first Lambda function. It is not publicly accessible.
    5. The second bucket is a temporary storage for files requested by the user. Objects are moved into this bucket from the first bucket. This bucket has a deletion policy that only allows objects to live inside it for 24 hours.

    Lambda Function for File Uploading:

    def upload():
        if request.method == 'POST':
            tag = request.form['tag']
            files = request.files.getlist('file')
            print(files)
            for file in files:
    
                print(file)
                if file:
                        filename = secure_filename(file.filename)
                        file.save(filename)
                        s3.upload_file(
                            Bucket = BUCKET_NAME,
                            Filename=filename,
                            Key = filename
                        )
                        
                        s3.put_object_tagging(
                            Bucket=BUCKET_NAME,
                            Key=filename,
                            Tagging={
                                'TagSet': [
                                    {
                                        'Key': 'Tag1',
                                        'Value': tag
                                    },
                                    {
                                        'Key': 'Tag2',
                                        'Value': 'Tag-value'
                                    },
                                ]
                            },
                        )
            msg = "Upload Done ! "

    The function lives within the Flask application. I have AWS permissions setup on my EC2 instance to allow the “put_object” function. You can assign tags as needed. The first tag references the $tag variable which is provided by the form submission.

    For Google Authentication I utilized a project I found on Github here. In the “auth” route that is created I modified it to authenticate against the “hd” parameter passed by the processes. You can see how this works here:

    @app.route('/auth')
    def auth():
        token = oauth.google.authorize_access_token()
        user = oauth.google.parse_id_token(token)
        session['user'] = user
        if "hd" not in user:
            abort(403)
        elif user['hd'] != 'Your hosted domain':
            abort(403)
        else:
            return redirect('/')

    If the “hd” parameter is not passed through the function it will abort with a “403” error.

    If you are interested in this project and want more information feel free to reach out and I can provide more code examples or package up the project for you to deploy on your own!

    If you found this article helpful please share it with your friends.

  • A File Extraction Project

    I had a client approach me regarding a set of files they had. The files were a set of certificates to support their products. They deliver these files to customers in the sales process.

    The workflow currently involves manually packaging the files up into a deliverable format. The client asked me to automate this process across their thousands of documents.

    As I started thinking through how this would work, I decided to create a serverless approach utilizing Amazon S3 for document storage and Lambda to do the processing and Amazon S3 and Cloudfront to generate a front end for the application.

    My current architecture involves two S3 buckets. One bucket to store the original PDF documents and one to pull in the documents that we are going to package up for the client before sending.

    The idea is that we can tag each PDF file with its appropriate lot number supplied by the client. I will then use a simple form submission process to supply input into the function that will collect the required documents.

    Here is the code for the web frontend:

    <!DOCTYPE html>
    <html>
    <head>
        <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.2.4/jquery.min.js"></script>
        <script type="text/javascript">
            $(document).ready(function() {
    
                $("#submit").click(function(e) {
                    e.preventDefault();
    
                    var lot = $("#lot").val();
    
                    $.ajax({
                        type: "POST",
                        url: 'API_URLHERE',
                        contentType: 'application/json',
                        data: JSON.stringify({
                            'body': lot,
                        }),
                        success: function(res){
                            $('#form-response').text('Query Was processed.');
                        },
                        error: function(){
                            $('#form-response').text('Error.');
                        }
                    });
    
                })
    
            });
        </script>
    </head>
    <body>
    <form>
        <label for="lot">Lot</label>
        <input id="lot">
        <button id="submit">Submit</button>
    </form>
    <div id="form-response"></div>
    </body>
    </html>

    This is a single field input form that sends a string to my Lambda function. Once the string is received we will convert it into a JSON object and then use that to find our objects within Amazon S3.

    Here is the function:

    import boto3
    import json
    
    
    def lambda_handler(event, context):
        form_response = event['body']
        tag_list = json.loads(form_response)
        print(tag_list)
        tag_we_want = tag_list['body']
        
        
        
        s3 = boto3.client('s3')
        bucket = "source_bucket"
        destBucket = "destination_bucket"
        download_list = []
        #get all the objects in a bucket
        get_objects = s3.list_objects(
            Bucket= bucket,
        )
    
        object_list = get_objects['Contents']
    
        object_keys = []
        for object in object_list:
            object_keys.append(object['Key'])
    
        object_tags = []
        for key in object_keys:
            object_key = s3.get_object_tagging(
                Bucket= bucket,
                Key=key,
            )
    
            object_tags.append(
                {
                'Key': key,
                'tags': object_key['TagSet'][0]['Value']
                }
            )
    
        for tag in object_tags:
    
            if tag['tags'] == tag_we_want:
                object_name = tag['Key']
                s3.copy_object(
                    Bucket= destBucket,
                    CopySource= {
                        'Bucket': bucket,
                        'Key': object_name,
                    },
                    Key= object_name,
                )
                download_list.append(object_name)
    
        return download_list, tag_we_want

    In this code, we define our source and destination buckets first. With the string from the form submission, we first gather all the objects within the bucket and then iterate over each object to find matching tags.

    Once we gather the files we want for our customers we then transfer these files to a new bucket. I return the list of files out of the function as well as the tag name.

    My next step is to package all the files required into a ZIP file for downloading. I first attempted to do this in Lambda but quickly realized you cannot use Lambda to generate files as the file system is read only.

    Right now, I am thinking of utilizing Docker to spawn a worker which will generate the ZIP file, place it back into the bucket and provide a time-sensitive download link to the client.

    Stay tuned for more updates on this project.

  • A Self Hosted Server Health Check

    I’m not big on creating dashboards. I find that I don’t look at them enough to warrant hosting the software on an instance and having to have the browser open to the page all the time.

    Instead, I prefer to be alerted via Slack as much as possible. I wrote scripts to collect DNS records from Route53. I decided that I should expand on the idea and create a scheduled job that would execute at a time interval. This way my health checks are fully automated.

    Before we get into the script, you might ask me why I don’t just use Route53 health checks! The answer is fairly simple. First, the cost of health checks for HTTPS doesn’t make sense for the number of web servers that I am testing. Second, I don’t want to test Route53 or any AWS resource from within AWS. Rather, I would like to use my own network to test as it is not connected to AWS.

    You can find the code and the Lambda function hosted on GitHub. The overall program utilizes a few different AWS products:

    • Lambda
    • SNS
    • CloudWatch Logs

    It also uses Slack but that is an optional piece that I will explain. The main functions reside in “main.py”. This piece of code follows the process of:

    1. Iterating over Route53 Records
    2. Filtering out “A” records and compiling a list of domains
    3. Testing each domain and processing the response code
    4. Logging all of the results to CloudWatch Logs
    5. Sending errors to the SNS topic

    I have the script running on a CRON job every hour.

    The second piece of this is the Lambda function. The function is all packaged in the “lambda_function.zip” but, I also added the function outside of the ZIP file for editing. You can modify this function to utilize your Slack credentials.

    The Lambda function is subscribed to your SNS topic so that whenever a new message appears, that message is sent to your specified Slack channel.

    I have plans to test my Terraform skills to automate the deployment of the Lambda function, SNS topic, CloudWatch Logs, and the primary script in some form.

    If you have any comments on how I could improve this function please post a comment here or raise an issue on GitHub. If you find this script helpful in anyway feel free to share it with your friends!

    Links:
    Server Health Check – GitHub

    Code – Main Function (main.py)

    import boto3
    import requests
    import os
    import time
    
    
    #aws variables
    sns = boto3.client('sns')
    aws = boto3.client('route53')
    cw = boto3.client('logs')
    paginator = aws.get_paginator('list_resource_record_sets')
    response = aws.list_hosted_zones()
    hosted_zones = response['HostedZones']
    time_now = int(round(time.time() * 1000))
    
    #create empty lists
    zone_id_to_test = []
    dns_entries = []
    zones_with_a_record = []
    #Create list of ZoneID's to get record sets from       
    for key in hosted_zones:
        zoneid = key['Id']
        final_zone_id = zoneid[12:]
        zone_id_to_test.append(final_zone_id)
    
    #Create ZoneID List    
    def getARecord(zoneid):
        for zone in zoneid:
            try:
                response = paginator.paginate(HostedZoneId=zone)
                for record_set in response:
                    dns = record_set['ResourceRecordSets']
                    dns_entries.append(dns)
    
            except Exception as error:
                print('An Error')
                print(str(error))
                raise
    #Get Records to test
    def getCNAME(entry):
        for dns_entry in entry:
            for record in dns_entry:
                if record['Type'] == 'A':
                    url = (record['Name'])
                    final_url = url[:-1]
                    zones_with_a_record.append(f"https://{final_url}")
    #Send Result to SNS                
    def sendToSNS(messages):
        message = messages
        try:
            send_message = sns.publish(
                TargetArn='YOUR_SNS_TOPIC_ARN_HERE',
                Message=message,
                )
        except:
            print("something didn't work")
    def tester(urls):
        for url in urls:
            try:
                user_agent = {'User-agent': 'Mozilla/5.0'}
                status = requests.get(url, headers = user_agent, allow_redirects=True)
                code = (status.status_code)
                if code == 401:
                    response = f"The site {url} reports status code: {code}"
                    writeLog(response)
                elif code == 301:
                    response = f"The site {url} reports status code: {code}"
                    writeLog(response)
                elif code == 302:
                    response = f"The site {url} reports status code: {code}"
                    writeLog(response)
                elif code == 403:
                    response = f"The site {url} reports status code: {code}"
                    writeLog(response)
                elif code !=200:
                    sendToSNS(f"The site {url} reports: {code}")
                    response = f"The site {url} reports status code: {code}"
                    writeLog(response)
                else:
                    response = f"The site {url} reports status code: {code}"
                    writeLog(response)
            except:
                sendToSNS(f"The site {url} failed testing")
                response = f"The site {url} reports status code: {code}"
                writeLog(response)
    
    def writeLog(message):
        getToken = cw.describe_log_streams(
            logGroupName='healthchecks',   
            )
        logInfo = (getToken['logStreams'])
        nextToken = logInfo[0]['uploadSequenceToken']
        response = cw.put_log_events(
            logGroupName='YOUR_LOG_GROUP_NAME',
            logStreamName='YOUR_LOG_STREAM_NAME',
            logEvents=[
                {
                    'timestamp': time_now,
                    'message': message
                },
            ],
            sequenceToken=nextToken
        )
    #Execute            
    getARecord(zone_id_to_test)
    getCNAME(dns_entries)
    tester(zones_with_a_record)
    
    

    Code: Lambda Function (lambda_function.py)

    import logging
    logging.basicConfig(level=logging.DEBUG)
    
    import os
    from slack import WebClient
    from slack.errors import SlackApiError
    
    
    slack_token = os.environ["slackBot"]
    client = WebClient(token=slack_token)
    
    def lambda_handler(event, context):
        detail = event['Records'][0]['Sns']['Message']
        response_string = f"{detail}"
        try:
            response = client.chat_postMessage(
                channel="YOUR CHANNEL HERE",
                text="SERVER DOWN",
                blocks = [{"type": "section", "text": {"type": "plain_text", "text": response_string}}]
            )   
    
        except SlackApiError as e:
            assert e.response["error"]
        return
  • Where Is It 5 O’Clock Pt: 4

    As much as I’ve scratched my head working on this project it has been fun to learn some new things and build something that isn’t infrastructure automation. I’ve learned some frontend web development some backend development and utilized some new Amazon Web Services products.

    With all that nice stuff said I’m proud to announce that I have built a fully functioning project that is finally working the way I intended it. You can visit the website here:

    www.whereisitfiveoclock.net

    To recap, I bought this domain one night as a joke and thought “Hey, maybe one day I’ll build something”. I started off building a fully Python application backed by Flask. You can read about that in Part 1.This did not work out the way I intended as it did not refresh the timezones on page load. In part 3 I discussed how I was rearchitecting the project to include an API that would be called upon page load.

    The API worked great and delivered two JSON objects into my frontend. I then parsed the two JSON objects into two separate tables that display where you can be drinking and where you probably shouldn’t be drinking.

    This is a snippet of the JavaScript I wrote to iterate over the JSON objects while adding them into the appropriate table:

    function buildTable(someinfo){
                    var table1 = document.getElementById('its5pmsomewhere')
                    var table2 = document.getElementById('itsnot5here')
                    var its5_json = JSON.parse(someinfo[0]);
                    var not5_json = JSON.parse(someinfo[1]);
                    var its5_array = []
                    var not5_array = []
                    its5_json['its5'].forEach((value, index) => {
    
                        var row = `<tr>
                                    <td>${value}</td>
                                    <td></td>
                                    </tr>`
                    
                        table1.innerHTML += row
                    })  
                    not5_json['not5'].forEach((value, index) => {
    
                            var row = `<tr>
                                    <td></td>
                                    <td>${value}</td>
                                    </tr>`
                    
                        table2.innerHTML += row
                    })  

    First I reference two different HTML tables. I then parse the JSON from the API. I take both JSON objects and iterate over them adding the timezones into the table and then returning them into the HTML table.

    If you want more information on how I did this feel free to reach out.

    I want to continue iterating over this application to add new features. I need to do some standard things like adding Google Analytics so I can track traffic. I also want to add a search feature and a map that displays the different areas of drinking acceptability.

    I also am open to requests. One of my friends suggested that I add a countdown timer to each location that it is not yet acceptable to be drinking.

    Feel free to reach out in the comments or on your favorite social media platform! And as always, if you liked this project please share it with your friends.

  • Where Is It Five O’Clock Pt: 3

    So I left this project at a point where I felt it needed to be re-architected based on the fact that Flask only executes the function once and not every time the page loads.

    I re-architected the application in my head to include an API that calls the Lambda function and returns a list of places where it is and is not acceptable to be drinking based on the 5 O’Clock rules. These two lists will be JSON objects that have a single key with multiple values. The values will be the timezones appropriate to be drinking in.

    After the JSON objects are generated I can reference them through the web frontend and display them in an appropriate way.

    At this point I have the API built out and fully funcitoning the way I think I want it. You can use it by executing the following:
    curl https://5xztnem7v4.execute-api.us-west-2.amazonaws.com/whereisit5

    I will probably only have this publically accessible for a few days before locking it back down.

    Hopefully, in part 4 of this series, I will have a frontend demo to show!

  • Where Is It Five O’Clock Pt: 1

    I bought the domain whereisitfiveoclock.net a while back and have been sitting on it for quite some time. I had an idea to make a web application that would tell you where it is five o’clock. Yes, this is a drinking website.

    I saw this project as a way to learn more Python skills, as well as some more AWS skills, and boy, has it put me to the test. So I’m going to write this series of posts as a way to document my progress in building this application.

    Part One: Building The Application

    I know that I want to use Python because it is my language of choice. I then researched what libraries I could use to build the frontend with. I came across Flask as an option and decided to run with that. The next step I had to do was actually find out where it was 5PM.

    In my head, I came up with the process that if I could first get a list of all the timezone and identify the current time in them I could filter out which timezones it was 5PM. Once establishing where it was 5PM, I can then get that information to Flask and figure out a way to display it.

    Here is the function for identifying the current time in all timezones and then storing each key pair of {Timezone : Current_Time }

    def getTime():
        now_utc = datetime.now(timezone('UTC'))
        #print('UTC:', now_utc)
        timezones = pytz.all_timezones
        #get all current times and store them into a list
        tz_array = []
        for tz in timezones:
            current_time = now_utc.astimezone(timezone(tz))
            values = {tz: current_time.hour}
            tz_array.append(values)
            
        return tz_array

    Once everything was stored into tz_array I took that info and passed it through the following function to identify it was 5PM. I have another function that identifies everything that is NOT 5PM.

    def find5PM():
        its5pm = []
        for tz in tz_array:
            timezones = tz.items()
            for timezone, hour in timezones:
                if hour >= 17:
                    its5pm.append(timezone)
        return its5pm

    I made a new list and stored just the timezone name into that list and return it.

    Once I had all these together I passed them through as variables to Flask. This is where I first started to struggle. In my original revisions of the functions, I was only returning one of the values rather than returning ALL of the values. This resulted in hours of struggling to identify the cause of the problem. Eventually, I had to start over and completely re-work the code until I ended up with what you see above.

    The code was finally functional and I was ready to deploy it to Amazon Web Services for public access. I will discuss my design and deployment in Part Two.

    http://whereisitfiveoclock.net

  • Automatically Transcribing Audio Files with Amazon Web Services

    Automatically Transcribing Audio Files with Amazon Web Services

    I wrote this Lambda function to automatically transcribe audio files that are uploaded to an S3 bucket. This is written in Python3 and utilizes the Boto3 library.

    You will need to give your Lambda function permissions to access S3, Transcribe and CloudWatch.

    The script will create an AWS Transcribe job with the format: 'filetranscription'+YYYYMMDD-HHMMSS

    I will be iterating over the script to hopefully add in a web front end as well as potentially branching to do voice call transcriptions for phone calls and Amazon Connect.

    You can view the code here

    If you have questions or comments feel free to reach out to me here or on any Social Media.

  • Slack’s New Nebula Network Overlay

    I was turned on to this new tool that the Slack team had built. As an avid Slack user, I was immediately intrigued to test this out.

    My use case is going to be relatively simple for the sake of this post. I am going to create a Lighthouse, or parent node, in an EC2 instance in my Amazon Web Services account. It will have an elastic IP so we can route traffic to it publically. I also will need to create a security group to allow traffic to port 4242 UDP. I will also allow this port inbound on my local firewall.

    Clone the GIT repository for Nebula and also download the binaries. I put everything into /etc/nebula

    Once you have all of the files downloaded you can generate your certificate of authority by running the command:

    ./nebula-cert ca -name "Your Company"

    You will want to make a backup of the ca.key and ca.cert file that is generated by this output.

    Once you have your certificate of authority you can create certificates for your hosts. In my case I am only generating one for my local server. The following command will generate the certificate and keys:

    ./nebula-cert sign -name "Something Memorable" -ip "192.168.100.2/24"

    Where it says “Something Memorable” I placed the hostname of the server I am using so that I remember. One thing that the documentation doesn’t go over is assigning the IP for your Lighthouse. Because I recognize the Lighthouse as more of a gateway I assigned it to 192.168.100.1 in the config file. This will be covered soon.

    There is a pre-generated configuration file located here. I simply copied this into a file inside of /etc/nebula/

    Edit the file as needed. Lines 7-9 will need to be modified for each host as each host will have its own certificate.

    Line 20 will need to be the IP address of your Lighthouse and this will remain the same on every host. On line 26 you will need to change this to true for your Lighthouse. On all other hosts, this will remain false.

    The other major thing I changed was to allow SSH traffic. There is an entire section about SSH in the configuration that I ignored and simply added the firewall to the bottom of the file as follows:

    - port: 22
    proto: tcp
    host: any

    This code is added below the 443 rule for HTTPS. Be sure to follow normal YAML notation practices.

    Once this is all in place you can execute your Nebula network by using the following command:

    /etc/nebula/nebula -config /etc/nebula/config.yml

    Execute your Lighthouse first and ensure it is up and running. Once it is running on your Lighthouse you can run it on your host and you should see a connection handshake. Test by pinging your Lighthouse from your host and from your Lighthouse to your host. I also tested file transfer as well using SCP. This verifies SSH connectivity.

    Now, the most important thing that Slack doesn’t discuss is creating a systemctl script for automatic startup. So I have included a basic one for you here:

    [Unit]
    Description=Nebula Service

    [Service]
    Restart=always
    RestartSec=1
    User=root
    ExecStart=/etc/nebula/nebula -config /etc/nebula/config.yml
    [Install]
    WantedBy=multi-user.target

    That’s it! I would love to hear about your implementations in the comments below!

  • Discovering DHCP Servers with NMAP

    I was working at a client site where a device would constantly receive a new IP address via DHCP nearly every second. It was the only device on the network that had this issue but I decided to test for rogue DHCP servers. If someone knows of a GUI tool to do this let me know in the comments. I utilized the command line utility NMAP to scan the network.

    sudo nmap --script broadcast-dhcp-discover

    The output should look something like:

    Starting Nmap 7.70 ( https://nmap.org ) at 2019-11-25 15:52 EST
    Pre-scan script results:
    | broadcast-dhcp-discover:
    | Response 1 of 1:
    | IP Offered: 172.20.1.82
    | DHCP Message Type: DHCPOFFER
    | Server Identifier: 172.20.1.2
    | IP Address Lease Time: 7d00h00m00s
    | Subnet Mask: 255.255.255.0
    | Time Offset: 4294949296
    | Router: 172.20.1.2
    | Domain Name Server: 8.8.8.8
    | Renewal Time Value: 3d12h00m00s
    |_ Rebinding Time Value: 6d03h00m00s

    This was the test that ran on my local network verifying only one DHCP server. If there were multiple, we would see another response.

    Ultimately this was not the issue at my client site but this is a new function of NMAP that I had not used.

    Let me know your experiences with rogue DHCP in the comments!