Your cart is currently empty!
Category: Linux
Deleting many files from the Linux Command Line
I’ll admit that this post is more for me than any of my readers. I have this command that is buried in my notes and always takes me forever to dig back out. I figured I’d publish it on my blog so that I would maybe commit it to memory.
Let’s say that you have a directory with so many files that a simple “rm *” will always fail. I’ve encountered this with many WordPress logging plugins that don’t have log purging setup.
Enter this simple Linux command line command:
find <path> -type f -exec rm '{}' \;
What this will do is find all the files in your path and delete them. You can modify this command with a bunch of other flags like:
find <path> -type f -mtime 30 -exec rm '{}' \;
Which will only delete files that haven’t been modified in the last 30 days.
I’m sure there are many other flags and conditions you could check to create an even more fine-grained delete script but this has been useful for me!
If this helps you, please share this with your friends!
Setting the Starting Directory for Windows Subsystem for Linux
I use Windows Subsystem for Linux almost every day. I run Ubuntu 20.04 for almost all of my development work. I recently re-installed Windows because I upgraded my PC after many years. One thing that has always bothered me is that when you launch WSL for the first time it doesn’t put you into your user’s home directory. But rather your Windows home directory. The fix for this is really quite simple.
First, navigate to the settings for Microsoft Terminal:
I use Visual Studio Code to do editing. Find the section that contains your WSL installation:
Just below the “source” line, add the following:
"startingDirectory": "//wsl$/Ubuntu-20.04/home/<user>",
Replace “Ubuntu-20.04” with your distro name and “<user>” with your username.
Save and exit!
Pandas & NumPy with AWS Lambda
Fun fact: Pandas and NumPy don’t work out of the box with Lambda. The libraries that you might download from your development machine probably won’t work either.
The standard Lambda Python environment is very barebones by default. There is no point in loading in a bunch of libraries if they aren’t needed. This is why we package our Lambda functions into ZIP files to be deployed.
My first time attempting to use Pandas on AWS Lambda was in regards to concatenating Excel files. The point of this was to take a multi-sheet Excel file and combine it into one sheet for ingestion into a data lake. To accomplish this I used the Pandas library to build the new sheet. In order to automate the process I setup an S3 trigger on a Lambda function to execute the script every time a file was uploaded.
And then I ran into this error:
[ERROR] Runtime.ImportModuleError: Unable to import module 'your_module': IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE! Importing the numpy c-extensions failed.
I had clearly added the NumPy library into my ZIP file:
So what was the problem? Well, apparently, the version of NumPy that I downloaded on both my Macbook and my Windows desktop is not compatible with Amazon Linux.
To resolve this issue, I first attempted to download the package files manually from PyPi.org. I grabbed the latest “manylinux1_x86_x64.whl” file for both NumPy and Pandas. I put them back into my ZIP file and re-uploaded the file. This resulted in the same error.
THE FIX THAT WORKED:
The way to get this to work without failure is to spin up an Amazon Linux EC2 instance. Yes this seems excessive and it is. Not only did I have to spin up a new instance I had to install Python 3.8 because Amazon Linux ships with Python 2.7 by default. But, once installed you can use Pip to install the libraries to a directory by doing:
pip3 install -t . <package name>
This is useful for getting the libraries in the same location to ZIP back up for use. You can remove a lot of the files that are not needed by running:
rm -r *.dist-info __pycache__
After you have done the cleanup, you can ZIP up the files and move them back to your development machine, add your Lambda function and, upload to the Lambda console.
Run a test! It should work as you intended now!
If you need help with this please reach out to me on social media or leave a comment below.
A File Management Architecture
This post is a continuation of my article: “A File Extraction Project”. This project has been a great learning experience for both frontend and backend application architecture and design. Below you will find a diagram and an explanation of all the pieces that make this work.
- The entire architecture is powered by Flask on an EC2 instance. When I move this project to production I intend to put an application load balancer in front to manage traffic. The frontend is also secured by Google Authentication. This provides authentication against the users existing GSuite deployment so that only individuals within the organization can access the application.
- The first Lambda function processes the upload functions. I am allowing for as many files as needed by the customer. The form also includes a single text field for specifying the value of the object tag. The function sends the objects into the first bucket which is object #4.
- The second Lambda function is the search functionality. This function allows the user to provide a tag value. The function queries all objects in bucket #4 and creates a list of objects that match the query. It then moves the objects to bucket #5 where it packages them up and presents them to the user in the form of a ZIP file.
- The first bucket is the storage for all of the objects. This is the bucket where all the objects are uploaded to from the first Lambda function. It is not publicly accessible.
- The second bucket is a temporary storage for files requested by the user. Objects are moved into this bucket from the first bucket. This bucket has a deletion policy that only allows objects to live inside it for 24 hours.
Lambda Function for File Uploading:
def upload(): if request.method == 'POST': tag = request.form['tag'] files = request.files.getlist('file') print(files) for file in files: print(file) if file: filename = secure_filename(file.filename) file.save(filename) s3.upload_file( Bucket = BUCKET_NAME, Filename=filename, Key = filename ) s3.put_object_tagging( Bucket=BUCKET_NAME, Key=filename, Tagging={ 'TagSet': [ { 'Key': 'Tag1', 'Value': tag }, { 'Key': 'Tag2', 'Value': 'Tag-value' }, ] }, ) msg = "Upload Done ! "
The function lives within the Flask application. I have AWS permissions setup on my EC2 instance to allow the “put_object” function. You can assign tags as needed. The first tag references the $tag variable which is provided by the form submission.
For Google Authentication I utilized a project I found on Github here. In the “auth” route that is created I modified it to authenticate against the “hd” parameter passed by the processes. You can see how this works here:
@app.route('/auth') def auth(): token = oauth.google.authorize_access_token() user = oauth.google.parse_id_token(token) session['user'] = user if "hd" not in user: abort(403) elif user['hd'] != 'Your hosted domain': abort(403) else: return redirect('/')
If the “hd” parameter is not passed through the function it will abort with a “403” error.
If you are interested in this project and want more information feel free to reach out and I can provide more code examples or package up the project for you to deploy on your own!
If you found this article helpful please share it with your friends.
Slack’s New Nebula Network Overlay
I was turned on to this new tool that the Slack team had built. As an avid Slack user, I was immediately intrigued to test this out.
My use case is going to be relatively simple for the sake of this post. I am going to create a Lighthouse, or parent node, in an EC2 instance in my Amazon Web Services account. It will have an elastic IP so we can route traffic to it publically. I also will need to create a security group to allow traffic to port 4242 UDP. I will also allow this port inbound on my local firewall.
Clone the GIT repository for Nebula and also download the binaries. I put everything into
/etc/nebula
Once you have all of the files downloaded you can generate your certificate of authority by running the command:
./nebula-cert ca -name "Your Company"
You will want to make a backup of the ca.key and ca.cert file that is generated by this output.
Once you have your certificate of authority you can create certificates for your hosts. In my case I am only generating one for my local server. The following command will generate the certificate and keys:
./nebula-cert sign -name "Something Memorable" -ip "192.168.100.2/24"
Where it says “Something Memorable” I placed the hostname of the server I am using so that I remember. One thing that the documentation doesn’t go over is assigning the IP for your Lighthouse. Because I recognize the Lighthouse as more of a gateway I assigned it to 192.168.100.1 in the config file. This will be covered soon.
There is a pre-generated configuration file located here. I simply copied this into a file inside of
/etc/nebula/
Edit the file as needed. Lines 7-9 will need to be modified for each host as each host will have its own certificate.
Line 20 will need to be the IP address of your Lighthouse and this will remain the same on every host. On line 26 you will need to change this to true for your Lighthouse. On all other hosts, this will remain false.
The other major thing I changed was to allow SSH traffic. There is an entire section about SSH in the configuration that I ignored and simply added the firewall to the bottom of the file as follows:
- port: 22
proto: tcp
host: anyThis code is added below the 443 rule for HTTPS. Be sure to follow normal YAML notation practices.
Once this is all in place you can execute your Nebula network by using the following command:
/etc/nebula/nebula -config /etc/nebula/config.yml
Execute your Lighthouse first and ensure it is up and running. Once it is running on your Lighthouse you can run it on your host and you should see a connection handshake. Test by pinging your Lighthouse from your host and from your Lighthouse to your host. I also tested file transfer as well using SCP. This verifies SSH connectivity.
Now, the most important thing that Slack doesn’t discuss is creating a systemctl script for automatic startup. So I have included a basic one for you here:
[Unit]
Description=Nebula Service
[Service]
Restart=always
RestartSec=1
User=root
ExecStart=/etc/nebula/nebula -config /etc/nebula/config.yml
[Install]
WantedBy=multi-user.target
That’s it! I would love to hear about your implementations in the comments below!
Discovering DHCP Servers with NMAP
I was working at a client site where a device would constantly receive a new IP address via DHCP nearly every second. It was the only device on the network that had this issue but I decided to test for rogue DHCP servers. If someone knows of a GUI tool to do this let me know in the comments. I utilized the command line utility NMAP to scan the network.
sudo nmap --script broadcast-dhcp-discover
The output should look something like:
Starting Nmap 7.70 ( https://nmap.org ) at 2019-11-25 15:52 EST
Pre-scan script results:
| broadcast-dhcp-discover:
| Response 1 of 1:
| IP Offered: 172.20.1.82
| DHCP Message Type: DHCPOFFER
| Server Identifier: 172.20.1.2
| IP Address Lease Time: 7d00h00m00s
| Subnet Mask: 255.255.255.0
| Time Offset: 4294949296
| Router: 172.20.1.2
| Domain Name Server: 8.8.8.8
| Renewal Time Value: 3d12h00m00s
|_ Rebinding Time Value: 6d03h00m00sThis was the test that ran on my local network verifying only one DHCP server. If there were multiple, we would see another response.
Ultimately this was not the issue at my client site but this is a new function of NMAP that I had not used.
Let me know your experiences with rogue DHCP in the comments!
Amazon S3 Backup from FreeNAS
I was chatting with my Dad about storage for his documents. He mentioned wanting to store them on my home NAS. I chuckled and stated that I would just push them up to the cloud because it would be cheaper and more reliable. When I got home that day I thought to myself how I would actually complete this task.
There are plenty of obvious tools to accomplish offsite backup. I want to push all of my home videos and pictures to an S3 bucket in my AWS environment. I could:
- Mount the S3 bucket using the drivers provided by AWS and then RSYNC the data across on a cron job.
- Utilize a FreeNAS plugin to drive the backup
- Build my own custom solution to the problem and re-invent the wheel!
It is clear the choice is going to be 3.
With the help of the Internet and I put together a simple Python script that will backup my data. I can then run this on a cron job to upload the files periodically. OR! I could Dockerize the script and then run it as a container! Queue more overkill.
The result is something complicated for a simple backup task. But I like it and it works for my environments. One of the most important things is that I can point the script at one directory that houses many Symlinks to other directories so I only have to manage one backup point.
Take a look at the GitHub link below and let me know your thoughts!
Lessons Learned from Migrating 17TB of Data
I finally pulled the trigger on some new hard drives for my home NAS. I am migrating from a 5U Server down two a small desktop size NAS. Ultimately this removes the need for my 42U standing rack.
I did this transfer a year or so ago when I did a full rebuild of my server but forgot to take any notes on the processes that I used. Instant regret. I remembered utilizing Rsync to do the actual transfer and I assumed that I mounted both the existing NAS to an NFS share and the new NAS through NFS. Both these mounts would reside inside a throwaway virtual machine on my application server.
I used the following Rsync command to start.
rsync --ignore-existing -ahzrvvv --progress {Source} {Destination}
To break this down a little bit:
–ignore-existing: This will ignore any existing files that copy over
-a: Archive flag. This preserves my data structure
-h: Human readable. If this flag exists for a command, use it. It makes things much easier to use.
-z: Compression. There are a bunch of different compression options for Rsync. This one does enough for me.
-r: This makes Rsync copy files recursively through the directories
-vvv: I put triple verbose on because I was having so many issues.
–progress: This will show the number of files and the progress of the file that is currently being copied. Especially useful when copying large files.
Now, my command changed over time but ultimately this is what I ended on. My source and destination were set to the respective NFS mounts and I hit [enter] to start the transfer. I left it running on the console of my Virtual Machine and walked away after I saw a handful of successful transfers. Assuming everything was going fine I went about my day as 17TB is going to take a while.
A few hours later I decided to check in on my transfer and saw that it had gotten stuck on a file after only 37KB of data transfer! Frustrated, I restarted the process. Only to see the same results later on.
After updating, downgrading, and modifying my command structure I came to the realization that there must be an issue with transferring between to NFS shares.
I am still researching why this happens but to me, it seems as though when the transfer starts the files are brought into a buffer somewhere within the Linux filesystem which gets maxed out causing the file transfer to stall. Almost as if the buffer can’t send the new files fast enough.
When I switched the transfer to utilize SSH instead of NFS to NFS the transfer completed successfully.
If someone has some information regarding how this works I would love to learn more.