Michael Ransley

Install Zookeeper on Ubuntu 20.04

Fri, 25 Mar 2022 00:00:00 -0500

Apache Zookeeper is required to support a number of software packages. In order to use my homelab I run this locally in the foreground to allow me to check each component - I know that I could just as easily do this in k8s but I like to turn components on and off as required - which I know that I could do in k8s.

This procedure assumes a working install of Ubuntu 20.04 with Java installed.

To install and run Zookeeper execute the following commands:

wget https://dlcdn.apache.org/zookeeper/zookeeper-3.8.0/apache-zookeeper-3.8.0-bin.tar.gz
tar xfvz apache-zookeeper-3.8.0-bin.tar.gz
ln -s apache-zookeeper-3.8.0-bin zookeeper
cd zookeeper
cp conf/zoo_sample.cfg conf/zoo.cfg
bin/zkServer.sh start-foreground

This should then start cleanly.

Install Kafka on Ubuntu 20.04

Fri, 25 Mar 2022 00:00:00 -0500

Apache Kafka is a streaming service that I am looking to play with in a home lab. In order to use my homelab I run this locally in the foreground to allow me to check each component - I know that I could just as easily do this in k8s but I like to turn components on and off as required - which I know that I could do in k8s.

This procedure assumes a working install of Ubuntu 20.04 with Java and Zookeeper installed.

To install and run Kafka execute the following commands:

wget https://dlcdn.apache.org/kafka/3.1.0/kafka_2.13-3.1.0.tgz
tar xfvz kafka_2.13-3.1.0.tgz
ln -s kafka_2.13-3.1.0 kafka
cd kafka
bin/kafka-server-start.sh config/server.properties

Install Druid on Ubuntu 20.04

Fri, 25 Mar 2022 00:00:00 -0500

Apache Druid is a streaming service that I am looking to play with in a home lab. In order to use my homelab I run this locally in the foreground to allow me to check each component - I know that I could just as easily do this in k8s but I like to turn components on and off as required - which I know that I could do in k8s.

This procedure assumes a working install of Ubuntu 20.04 with Java 8 and Zookeeper installed.

To install and run Druid execute the following commands:

wget wget https://dlcdn.apache.org/druid/0.22.1/apache-druid-0.22.1-bin.tar.gz
tar xfvz apache-druid-0.22.1-bin.tar.gz
ln -s apache-druid-0.22.1 druid
cd druid

Edit the file config/supervise/single-server/small.conf and comment out the following lines:

:verify bin/verify-default-ports
#!p10 zk bin/run-zk conf

Both of these assume that we are not running a shared zookeeper instance. To start the server run the following command:

bin/start-single-server-small

To open the druid console open a browser to http://localhost:8888/

Install K3S on Ubuntu 20.04

Mon, 28 Feb 2022 00:00:00 -0600

I am mucking around a bit with a homelab and thought I would look into K3s. To start, I created an absolutely vanilla Ubuntu 20.04 Server instance with 8Gb of Memory and 30Gb of disk. Then I ran the following command:

curl -sfL https://get.k3s.io | sh -

The output of the command is as follows:

[INFO]  Finding release for channel stable
[INFO]  Using v1.22.6+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.22.6+k3s1/sha256sum-amd64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.22.6+k3s1/k3s
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Creating /usr/local/bin/ctr symlink to k3s
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

You should then be able to query the node details running:

sudo k3s kubectl get node

Which should output something like the following

NAME   STATUS   ROLES                  AGE    VERSION
k3s    Ready    control-plane,master   3m4s   v1.22.6+k3s1

Install Tanzu on Ubuntu 20.04

Fri, 25 Feb 2022 00:00:00 -0600

I am mucking around a bit with a homelab and thought I would look into Tanzu. To start, I created an absolutely vanilla Ubuntu 20.04 Server instance with 8Gb of Memory and 30Gb of disk. In then installed some pre-requisite software:

To install Tanzu run the following command:

brew install vmware-tanzu/tanzu/tanzu-community-edition

This command will output something similar to the following:

==> ******************************************************************************
==> * To initialize all plugins required by Tanzu Community Edition, an additional
==> * step is required. To complete the installation, please run the following
==> * shell script:
==> *
==> * /home/linuxbrew/.linuxbrew/Cellar/tanzu-community-edition/v0.10.0/libexec/configure-tce.sh
==> *
==> ******************************************************************************
==>

Run that command, then run the following command:

tanzu management-cluster create --ui --bind 0.0.0.0:8080

Note: The bind address is required because my ubuntu server does not have a GUI, I am accessing it via SSH.

To run the management console open up a browser and point to http://<tanzu_server_ip>:8080 and it should display something similar to the following:

Click on the Docker deploy option and this moves you through a prerequisite check, cluster naming and some network configuration, I accepted the defaults.

I then reviewed the configuration and chose “Deploy Management Cluster” as shown below:

This then moves though a deployment process which takes some time (minutes in my case). Logs can either be viewed in the GUI and command line and when the process completes the command executed earlier will exit.

To validate that the cluster is running, run the following command:

tanzu management-cluster get

For my server this returned the following information:

  NAME     NAMESPACE   STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES
ransley  tkg-system  running  1/1           1/1      v1.21.5+vmware.1  management


Details:

NAME                                                        READY  SEVERITY  REASON  SINCE  MESSAGE
/ransley                                                    True                     2m55s
├─ClusterInfrastructure - DockerCluster/ransley             True                     3m5s
├─ControlPlane - KubeadmControlPlane/ransley-control-plane  True                     2m55s
│ └─Machine/ransley-control-plane-n7sgp                     True                     2m59s
└─Workers
└─MachineDeployment/ransley-md-0
    └─Machine/ransley-md-0-579f95df8b-l7wj8                 True                     2m59s


Providers:

NAMESPACE                          NAME                   TYPE                    PROVIDERNAME  VERSION  WATCHNAMESPACE
capd-system                        infrastructure-docker  InfrastructureProvider  docker        v0.3.23
capi-kubeadm-bootstrap-system      bootstrap-kubeadm      BootstrapProvider       kubeadm       v0.3.23
capi-kubeadm-control-plane-system  control-plane-kubeadm  ControlPlaneProvider    kubeadm       v0.3.23
capi-system                        cluster-api            CoreProvider            cluster-api   v0.3.23

To configure kubectl run the following command:

tanzu management-cluster kubeconfig get <MGMT-CLUSTER-NAME> --admin

So in my case:

tanzu management-cluster kubeconfig get ransley --admin

I was then able to run:

kubectl get nodes

Which return the following:

NAME                            STATUS   ROLES                  AGE   VERSION
ransley-control-plane-n7sgp     Ready    control-plane,master   11m   v1.21.5+vmware.1
ransley-md-0-579f95df8b-l7wj8   Ready    <none>                 10m   v1.21.5+vmware.1

To create a workload cluster run the following command:

tanzu cluster create ransleywl --plan dev

This command took a few minutes to run and its output was as follows:

Validating configuration...
Warning: Pinniped configuration not found. Skipping pinniped configuration in workload cluster. Please refer to the documentation to check if you can configure pinniped on workload cluster manually
Creating workload cluster 'ransleywl'...
Waiting for cluster to be initialized...
[cluster control plane is still being initialized: WaitingForControlPlane, cluster infrastructure is still being provisioned: WaitingForControlPlane]
cluster control plane is still being initialized: Bootstrapping @ Machine/ransleywl-control-plane-vmhz7
Waiting for cluster nodes to be available...
Waiting for addons installation...
Waiting for packages to be up and running...

Workload cluster 'ransleywl' created

To verify the cluster started successfully run the following command:

tanzu cluster list

Which will output the following:

NAME       NAMESPACE  STATUS   CONTROLPLANE  WORKERS  KUBERNETES        ROLES   PLAN
ransleywl  default    running  1/1           1/1      v1.21.5+vmware.1  <none>  dev

To update kubectl for the new cluster run the following command:

tanzu cluster kubeconfig get ransleywl --admin
kubectl config use-context ransleywl-admin@ransleywl

To show the pods in the cluster run the following command:

kubectl get pods --all-namespaces

Which ouptu the following:

NAMESPACE      NAME                                                    READY   STATUS    RESTARTS   AGE
kube-system    antrea-agent-52zjf                                      2/2     Running   0          6m5s
kube-system    antrea-agent-6r7cf                                      2/2     Running   0          6m5s
kube-system    antrea-controller-7dc9d9c8d7-9frm2                      1/1     Running   0          6m4s
kube-system    coredns-657879bf57-9fvdm                                1/1     Running   0          13m
kube-system    coredns-657879bf57-wmnct                                1/1     Running   0          13m
kube-system    etcd-ransleywl-control-plane-vmhz7                      1/1     Running   0          13m
kube-system    kube-apiserver-ransleywl-control-plane-vmhz7            1/1     Running   0          13m
kube-system    kube-controller-manager-ransleywl-control-plane-vmhz7   1/1     Running   0          13m
kube-system    kube-proxy-cmq8g                                        1/1     Running   0          13m
kube-system    kube-proxy-n64z9                                        1/1     Running   0          12m
kube-system    kube-scheduler-ransleywl-control-plane-vmhz7            1/1     Running   0          13m
kube-system    metrics-server-649578757b-l844s                         1/1     Running   0          6m13s
tanzu-system   secretgen-controller-f44c8b9c6-b84hw                    1/1     Running   0          7m46s
tkg-system     kapp-controller-f8d47f95c-v9qlf                         1/1     Running   0          12m
tkg-system     tanzu-capabilities-controller-manager-7959d6b44-v96qm   1/1     Running   0          13m

Install Kubectl on Ubuntu 20.04

Fri, 25 Feb 2022 00:00:00 -0600

To install kubectl on Ubuntu run the following commands:

sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
sudo apt install -y kubectl

To verify that it works run the following command:

kubectl

And it should respond with something similar to the following:

kubectl controls the Kubernetes cluster manager.

Find more information at: https://kubernetes.io/docs/reference/kubectl/overview/

Basic Commands (Beginner):
create        Create a resource from a file or from stdin
expose        Take a replication controller, service, deployment or pod and expose it as a new Kubernetes service
run           Run a particular image on the cluster
set           Set specific features on objects

Basic Commands (Intermediate):
explain       Get documentation for a resource
get           Display one or many resources
edit          Edit a resource on the server
delete        Delete resources by file names, stdin, resources and names, or by resources and label selector

Deploy Commands:
rollout       Manage the rollout of a resource
scale         Set a new size for a deployment, replica set, or replication controller
autoscale     Auto-scale a deployment, replica set, stateful set, or replication controller

Cluster Management Commands:
certificate   Modify certificate resources.
cluster-info  Display cluster information
top           Display resource (CPU/memory) usage
cordon        Mark node as unschedulable
uncordon      Mark node as schedulable
drain         Drain node in preparation for maintenance
taint         Update the taints on one or more nodes

Troubleshooting and Debugging Commands:
describe      Show details of a specific resource or group of resources
logs          Print the logs for a container in a pod
attach        Attach to a running container
exec          Execute a command in a container
port-forward  Forward one or more local ports to a pod
proxy         Run a proxy to the Kubernetes API server
cp            Copy files and directories to and from containers
auth          Inspect authorization
debug         Create debugging sessions for troubleshooting workloads and nodes

Advanced Commands:
diff          Diff the live version against a would-be applied version
apply         Apply a configuration to a resource by file name or stdin
patch         Update fields of a resource
replace       Replace a resource by file name or stdin
wait          Experimental: Wait for a specific condition on one or many resources
kustomize     Build a kustomization target from a directory or URL.

Settings Commands:
label         Update the labels on a resource
annotate      Update the annotations on a resource
completion    Output shell completion code for the specified shell (bash, zsh or fish)

Other Commands:
alpha         Commands for features in alpha
api-resources Print the supported API resources on the server
api-versions  Print the supported API versions on the server, in the form of "group/version"
config        Modify kubeconfig files
plugin        Provides utilities for interacting with plugins
version       Print the client and server version information

Usage:
kubectl [flags] [options]

Use "kubectl <command> --help" for more information about a given command.
Use "kubectl options" for a list of global command-line options (applies to all commands).

Install Kubeapps on Kubernetes

Fri, 25 Feb 2022 00:00:00 -0600

I am mucking around a bit with a homelab and thought I would look into Tanzu and Kubeapps. To start, I created an absolutely vanilla Ubuntu 20.04 Server instance with 8Gb of Memory and 30Gb of disk. In then installed some pre-requisite software:

Tanzu or K3S
Helm

To install kubeapps run the following commands:

helm repo add bitnami https://charts.bitnami.com/bitnami
kubectl create namespace kubeapps
helm install kubeapps --namespace kubeapps bitnami/kubeapps --set useHelm3=true

To create a service account, run the following commands:

kubectl create serviceaccount kubeapps-operator
kubectl create clusterrolebinding kubeapps-operator --clusterrole=cluster-admin --serviceaccount=default:kubeapps-operator

Then to expose the service so that you can access it:

kubectl port-forward --namespace kubeapps service/kubeapps --address 0.0.0.0 8080:80

Then to expose the token for authentication:

kubectl get secret $(kubectl get serviceaccount kubeapps-operator -o jsonpath='{range .secrets[*]}{.name}{"\n"}{end}' | grep kubeapps-operator-token) -o jsonpath='{.data.token}' -o go-template='{{.data.token | base64decode}}' && echo

Then open a browser and navigate to http://tanzu_server:8080 and it will prompt you for a token, enter the token that was created earlier. You should see a page similiar to the following:

Install Helm on Ubuntu 20.04

Fri, 25 Feb 2022 00:00:00 -0600

I am playing around with a homelab with Kubernetes and Ubuntu and thought I would add helm into the configuration. To start, I created an absolutely vanilla Ubuntu 20.04 Server instance with 8Gb of Memory and 30Gb of disk. In then installed some pre-requisite software:

To install Helm run the following command:

brew install helm

To verify that helm is installed run the following command:

helm version

And it should return something similar to following:

version.BuildInfo{Version:"v3.8.0", GitCommit:"d14138609b01886f544b2025f5000351c9eb092e", GitTreeState:"clean", GoVersion:"go1.17.6"}

Install Docker on Ubuntu 20.04

Fri, 25 Feb 2022 00:00:00 -0600

There are of course a lot of instructions to do this but I thought I would place the way that I do it for anyone else you may be interested.

sudo apt install docker.io
sudo groupadd docker
sudo usermod -aG docker $USER

Restart the shell, if you are using ubuntu GUI then you will need to logout and log back in.

To test that it is working run the following command:

docker run hello-world

If this works you should see output similar to the following:

Unable to find image 'hello-world:latest' locally
latest: Pulling from library/hello-world
2db29710123e: Pull complete
Digest: sha256:97a379f4f88575512824f3b352bc03cd75e239179eea0fecc38e597b2209f49a
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
1. The Docker client contacted the Docker daemon.
2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
    (amd64)
3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
$ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
https://hub.docker.com/

For more examples and ideas, visit:
https://docs.docker.com/get-started/

Install Brew on Ubuntu 20.04

Fri, 25 Feb 2022 00:00:00 -0600

Now my experience of Brew is mainly related to OSX, but it can be used in linux as well. To install it run the following commands:

sudo apt-get install build-essential
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
echo 'eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"' >> $HOME/.profile
eval "$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)"

To use native tools run the following:

sudo chmod o+r /etc/rancher/k3s/k3s.yaml

In order to access kubectl without using sudo run the following command:

export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

CloudWatch Log Retention and Lambda

Thu, 29 Oct 2020 00:00:00 -0500

One of the great things about Lambda is that when you create a function it automatically creates an associated CloudWatch Log Group so that the output of your function is captured. The problem is that by default this log group will have a retention of “forever”, which can be costly for a function that is either run a lot or writes a lot of information to CloudWatch.

To fix this, what you need to do is to update the log retention and the easiest way to do this is through the console, but this obviously isn’t the right way.

If we assume that you have already run your function and you deployed it using terraform but you want to fix up the log retention then this is how you do it:

Create the CloudWatch Log Group Definition

resource "aws_cloudwatch_log_group" "my_log_group" {
    name = "/aws/lambda/my-function"
    retention_in_days = 30
}

Now, if you run this then you will get an error saying that a duplicate resource exists - which it does!

Import the existing resource into terraform

For a terraform resource that is in a module:

terraform import module.datalake_firehose.aws_cloudwatch_log_group.my_log_group "/aws/lambda/my-function"

For a terraform resource that is not in a module:

terraform import aws_cloudwatch_log_group.my_log_group "/aws/lambda/my-function"

Execution

If you then run terraform apply then this should update the retention on the CloudWatch log group.

AWS Lake Formation

Tue, 20 Aug 2019 00:00:00 -0500

AWS continues to raise the bar across a whole lot of technology segments and in AWS Lake Formation they have created a one-stop shop for the creation of Data Lakes. As always, AWS is further abstracting their services to provide more and more customer value. The evolution of this process can be seen by looking at AWS Glue.

Note: This document was written in August 2019 - if you are reading this at some distant time in the future, functionality may (will?) have changed.

AWS Glue, a History

Back in the day, when EC2 launched it was a massive game changer. The ability to be able provision instances in seconds through an API was a revolution. It was no surprise that many users who used EC2 used it for Hadoop workloads which could benefit from AWS scale to crunch large volumes of data and then terminate when the work had been completed.
AWS recognised this need and created EMR (Elastic Map Reduce). No longer did we need to configure Hadoop clusters, now all we had to specify through the API was pretty much how many worker nodes we wanted and AWS magic presented us with a cluster with all the node configuration done for us. Additional Software was added in as well allowing to layer additional software onto EMR at instance creation time.
Once again, AWS looked at the workloads and realised that many people were using EMR to run Apache Spark jobs. From this they created Glue, which is effectively a managed Spark implementation (with extensions). Interestingly, they moved away from a raw CPU pricing mechanism to a DPU (Data Processing Unit), which is 4 CPU’s and 16 GB of RAM but your jobs must run a minimum of 2 DPU’s for a minimum of 10 minutes (see pricing for more information).

Back to Data Lakes

Obviously, a massive use case with AWS is Data Lakes and data processing platforms generally. A standard design for the data lakes was to use S3 for storage, EMR/Glue for data processing and the AWS Glue Data Catalog as a metadata store. AWS has rolled these services into a single unified data lake approach called AWS Lake Foundation.

One thing to note about AWS Lake Formation is that while it is a product itself, it is more of an orchestration layer and interface across a whole lot of AWS tools, as shown in the diagram below:

You can see that Lake Formation contains the building blocks of AWS data platform:

The Source crawlers are Glue Crawlers.
The ETL and Data Prep are probably provided by Glue Jobs.
The Date catalog is probably the Glue Data Catalog.
The Security Settings and Access Control are probably provided by a combination of the Glue Data Catalog and AWS IAM.

Obviously, the other components are named above. So what Lake Formation provides is an orchestration and management layer across these services. Once again, AWS is raising the bar in their platform and making it even more usable for customers.

Setup

AWS Lake Formation has a 3 step setup:

Register your AWS storage.
Create a database
Grant permissions

Register you AWS storage

Interestingly, this requires you to have already created an AWS bucket, it doesn’t create on for you. For the purposes of this investigation I have created a bucket called cmd-lake-formation-demo, clicking on “Register Location” and enter the path of your bucket - s3://cmd-lake-formation-demo:

Clicking “Register Location” creates the location, but click on dashboard again it unfortunately doesn’t show me that a location has been setup. Anyway, I know that I have setup a location so I will roll onto to step 2 of the setup.

Create a database

Once the storage is registered, you need to create a database to store the metadata. This is done by clicking on the “Create Database” button under stage 2. If have placed the database into the bucket that I created above and have gone with some sensible values, as shown below:

Grant Permissions

Next step is to grant permission, once again I have chosen the database that we have created earlier and gone with sensible defaults:

Interestingly, the screenshot above gives a little nugget of information “Active Directory Users and Groups (EMR Beta Only)” - I am guessing that there is an EMR beta that hooks all the software up to Active Directory for those applications that have a user interface - Spark etc, but most importantly Zeppelin and Jupyter! Note: Another post will be done on this when it becomes Generally Available.

Ignoring the above, I clicked “Grant” and it came back to the permissions section. The dashboard is unfortunately still showing the setup but I suppose there will be the need to add additional locations, permissions and databases as the system is used.

Ingesting Data

Lake Formation appears to have three methods for loading data into the lake.

AWS Glue Crawlers.
AWS Glue Jobs.
Blueprints

Obviously the crawlers and jobs are existing technology that has been around for a little while, but the blue prints are interesting…

The common ones are going to be the database snapshot and incremental database loads in my experience. Now looking at the dialog it is clear that this uses AWS Glue for the data movement and hence you will be paying Glue Pricing for the database synchronizations (i.e. 2 * DPU * USD $0.44/hr with a 10 minute minimum). This means that frequent syncing for your data may get pricey, especially if you are syncing lots of individual and small data sources.

I shall look into some of these blueprints in a future post.

Creating SSM Parameters

Thu, 27 Jun 2019 00:00:00 -0500

There are many services that I really like in AWS but one of those that doesn’t get a lot of attention is AWS Systems Manager. While I can’t say that I use all the features regularly of this service, two that I use a lot are the Session Manager and also the parameter store.

For this post, I will show how to create a value in the parameter store - I use this all the time because it is better than storing passwords in plain text and access to the credential is actually audited.

To create a value in parameter store you run the following command from the CLI:

aws ssm put-parameter --name mysecurepassword --type SecureString --value "passw0rd"

This will create the password in the Systems Manager Parameter store. Access to the credential can be controlled by IAM and if you have enabled it you can audit accesses to it via CloudTrail.

S3 Events and Error 403

Wed, 26 Jun 2019 00:00:00 -0500

I have been performing some data transformations using Lambda and S3 Events and for certain S3 keys I noticed that I was getting an Error 403 in my code:

[ERROR] ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
Traceback (most recent call last):
    File "/var/task/index.py", line 159, in handler
        if s3_obj.content_length < (400 * MB):
    File "/var/runtime/boto3/resources/factory.py", line 339, in property_loader
        self.load()
    File "/var/runtime/boto3/resources/factory.py", line 505, in do_action
        response = action(self, *args, **kwargs)
    File "/var/runtime/boto3/resources/action.py", line 83, in __call__
        response = getattr(parent.meta.client, operation_name)(**params)
    File "/var/runtime/botocore/client.py", line 320, in _api_call
        return self._make_api_call(operation_name, kwargs)
    File "/var/runtime/botocore/client.py", line 623, in _make_api_call
        raise error_class(parsed_response, operation_name)

The interesting thing that came from this was that the error seemed to specifically happen on files that contained special (i.e. non-alphanumeric) characters. Research indicated that the characters were actually valid - see https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html - so it isn’t an issue with the key name itself.

It turns out that the issue is with the fact that when the S3 Event returns the queue it is URL encoded but when you attempt to use the key in boto3 then it fails. The solution is to do the following:

from urllib import parse
import boto3

S3_RESOURCE = boto3.resource('s3')

def handler(event, context): # pylint: disable=unused-argument
"""
Lambda entry point.
"""
for record in event["Records"]:
    # Lambda struggles to process a file over 400Mb, so am not going to process files larger
    # than that.
    bucket = record["s3"]["bucket"]["name"]
    key = parse.unquote(record["s3"]["object"]["key"])
    s3_obj = S3_RESOURCE.Object(bucket, key)

The s3_obj above will then start working.

Backup for SQL Server RDS

Sat, 15 Dec 2018 00:00:00 -0600

Earlier I described how I performed a restore of a database into Amazon RDS for SQL Server. While I don’t play in SQL Server a lot I had a different requirement from another client a few weeks later that required me to perform a backup of an existing RDS instance.

Now, for those that have never done it, there is effectively two steps in perform the backup:

Starting the backup process by calling the stored procedure msdb.dbo.rds_backup_database.
Monitoring the progress of the backup to ensure that it completes successfully by calling the stored procedure msdb.dbo.rds_task_status.

There are a few error conditions that can occur and most of them relate to insufficient permissions on the S3 bucket.

The following script will perform a synchronous backup of RDS on SQL Server and will wait for the backup to complete before it returns a success. We ran this code from a GitLab runner.

Some points to note:

Line 3 needs to be updated for your hostname.
Line 4 needs to be updated to a user with access to perform a backup of the database.
Line 5 needs to be updated to be the password for the user defined in Line 4.
Line 6 needs to be updated for the ARN of the bucket and key to backup to.
Line 7 needs to be updated to the database that needs backing up.
Line 10 actually performs the backup.

The rest of the script after line 10 will check the backup to see if it is successful. If the backup fails (generally related to S3 permissions) then the exit code should be 1, otherwise the exit code will be 2. The script expects the backup to complete in 60 minutes.

Restoring SQL Server Databases in RDS from S3

Mon, 19 Nov 2018 00:00:00 -0600

I often don’t play in RDS that much but every once in a while I jump back into it when a clients needs me to do so. I know that there is a way of restoring a SQL Server database from an S3 location, but for the life of me I could remember how it was done. For added bonus points I decided that I should try interacting with the database completely without a windows host.

The trick to ensure that the database can be restored is to assign an Option Group to the RDS Instance:

RDSOptionGroup:
  Type: "AWS::RDS::OptionGroup"
  Properties:
    EngineName: "sqlserver-web"
    MajorEngineVersion: "13.00"
    OptionGroupDescription: "DB Option Group for nonprod-test-appname"
    OptionConfigurations:
        - OptionName: SQLSERVER_BACKUP_RESTORE
          OptionSettings:
          - Name: IAM_ROLE_ARN
            Value: arn:aws:iam::123456789012:role/client-role-appname-rdsrestore
    Tags:
      - Key: Name
        Value: "nonprod-test-appname"

RDSInstance:
  Type: "AWS::RDS::DBInstance"
  Properties:
    AllowMajorVersionUpgrade: "False"
    AutoMinorVersionUpgrade: "True"
    CopyTagsToSnapshot: "True"
    Engine: sqlserver-web
    EngineVersion: 13.00.4466.4.v1
    DBInstanceClass: "db.t2.small"
    DBSubnetGroupName: !Ref RDSSubnetGroup
    MultiAZ: "False"
    OptionGroupName: !Ref RDSOptionGroup
    PubliclyAccessible: "False"
    Tags:
      - Key: "Name"
        Value: "nonprod-test-app"
    DBInstanceIdentifier: "nonprod-test-app"
    AllocatedStorage: "20"
    BackupRetentionPeriod: "35"
    MasterUserPassword: !Ref MasterUserPassword
    MasterUsername: DBUser
    StorageEncrypted: True
    StorageType: "gp2"
    VPCSecurityGroups:
          sg-0123456780abcdefg

Now, the bits to note above is the MajorEngineVersion of 13 is actually SQL Server 2016 and the OptionConfigurations points to an IAM role. It is this IAM role that RDS uses to actually perform the backup or restore operation. The IAM role needs to be similar to the following:

UserRoleAppRdsRestore:
  Type: "AWS::IAM::Role"
  Properties:
    RoleName: "client-role-appname-rdsrestore"
    AssumeRolePolicyDocument:
      Statement:
        - Action: [ "sts:AssumeRole" ]
          Effect: Allow
          Principal:
            Service: [ "rds.amazonaws.com" ]

UserPolicyAppRdsRestore:
  Type: "AWS::IAM::Policy"
  Properties:
    Roles:
        - !Ref UserRoleAppRdsRestore
    PolicyName: "APP-POLICY-RDSRESTORE"
    PolicyDocument:
        Statement:
          - Action:
            - "s3:PutObject"
            - "s3:GetObject"
            - "s3:GetObjectMetaData"
            - "s3:AbortMultipartUpload"
            - "s3:ListMultipartUploadParts"
            - "s3:ListBucket"
            - "s3:GetBucketLocation"
          Effect: Allow
          Resource:
            - "arn:aws:s3:::bucket_name"
            - "arn:aws:s3:::bucket_name/*"

Once this is deployed, log into an instance that has network access to the RDS instance and run the following commands (note: This assumes Docker is installed and running on the host):

docker run -it mcr.microsoft.com/mssql-tools

Then run the following to connect to the host:

sqlcmd -S nonprod-test-app.abcdefghijk.ap-southeast-2.rds.amazonaws.com -U dbusername

At this point you should be logged into the database, then type the following commands:

use master
go
exec msdb.dbo.rds_restore_database @restore_db_name='dbname', @S3_arn_to_restore_from='arn:aws:s3:::bucket_name/folder/db.bak'
go

To check the status of the restore run the following command inside the same window:

exec msdb.dbo.rds_task_status

And hopefully your database should be successfully restored.

PySpark and Glue together

Tue, 28 Aug 2018 00:00:00 -0500

I have been playing around with Spark (in EMR) and the Glue Data Catalog a bit and I really like using them together. The ability of you being able to use EMR to transform the data and then being able to query it in either Spark, Glue or Athena - and through Athena via a JDBC data source is a real winner.

That said, it isn’t really that clear on how you access and update the Glue Data Catalog from within EMR. This post will hopefully point you in the right direction.

EMR Setup

To work with the Glue Data Catalog the EMR cluster needs to be configured to use it. There are details for this at AWS, but within automation the following values need to be put into the cluster:

- Classification: "hive-site"
  Properties:
    'hive.metastore.client.factory.class': "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
- Classification: "spark-hive-site"
  Properties:
    'hive.metastore.client.factory.class': "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"

Accessing glue tables from EMR

To access the tables from within a Spark step you need to instantiate the spark session with the glue catalog:

spark = SparkSession.builder \
        .appName(job_name) \
        .config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.catalog.setCurrentDatabase("mydatabase")

The database name above is the name of the database within the Glue configurations.

Now that the Spark Session is setup correctly you can query the glue catalog through Spark SQL:

spark.sql("select * from mytable")

Note, you don’t need to specify the location of the table, all this information is stored in the Glue Data Catalog.

Creating Glue Data Catalog Tables from Spark on EMR

Now, the prevailing wisdom is that you use the glue crawlers to update the data catalog - my feeling is that where possible the catalog should be updated by the process that is actually landing (or modifying) the data. The advantage that this gives it allows subsequent steps to execute and use the updated catalog without needed to run a crawler.

To create the table in glue and save the data into parquet, run the following command:

dataframe.write.mode("overwrite").format("parquet").option("path", parquet_path).saveAsTable(glue_table)

Creating Glue Data Catalog Tables from Glue Jobs

Now, you would think that this is easy… but unfortunately it isn’t. This will be the subject of another post once I do more research.

AWS

Posts by Category : AWS

29 Oct 2020 CloudWatch Log Retention and Lambda

20 Aug 2019 AWS Lake Formation

27 Jun 2019 Creating SSM Parameters

26 Jun 2019 S3 Events and Error 403

15 Dec 2018 Backup for SQL Server RDS

19 Nov 2018 Restoring SQL Server Databases in RDS from S3

28 Aug 2018 PySpark and Glue together

kubernetes

Posts by Category : {{ page.title }}

{% for post in site.categories.kubernetes %}

{% endfor %}

ubuntu

Posts by Category : {{ page.title }}

{% for post in site.categories.ubuntu %}

{% endfor %}