Friday, December 2, 2022
HomeBig DataStream Amazon EMR on EKS logs to third-party suppliers like Splunk, Amazon...

Stream Amazon EMR on EKS logs to third-party suppliers like Splunk, Amazon OpenSearch Service, or different log aggregators


Spark jobs operating on Amazon EMR on EKS generate logs which are very helpful in figuring out points with Spark processes and in addition as a approach to see Spark outputs. You’ll be able to entry these logs from quite a lot of sources. On the Amazon EMR digital cluster console, you may entry logs from the Spark Historical past UI. You even have flexibility to push logs into an Amazon Easy Storage Service (Amazon S3) bucket or Amazon CloudWatch Logs. In every methodology, these logs are linked to the precise job in query. The widespread apply of log administration in DevOps tradition is to centralize logging by the forwarding of logs to an enterprise log aggregation system like Splunk or Amazon OpenSearch Service (successor to Amazon Elasticsearch Service). This lets you see all of the relevant log knowledge in a single place. You’ll be able to determine key developments, anomalies, and correlated occasions, and troubleshoot issues sooner and notify the suitable folks in a well timed style.

EMR on EKS Spark logs are generated by Spark and may be accessed through the Kubernetes API and kubectl CLI. Subsequently, though it’s potential to put in log forwarding brokers within the Amazon Elastic Kubernetes Service (Amazon EKS) cluster to ahead all Kubernetes logs, which embrace Spark logs, this could change into fairly costly at scale since you get info that will not be essential for Spark customers about Kubernetes. As well as, from a safety standpoint, the EKS cluster logs and entry to kubectl will not be accessible to the Spark person.

To resolve this drawback, this publish proposes utilizing pod templates to create a sidecar container alongside the Spark job pods. The sidecar containers are capable of entry the logs contained within the Spark pods and ahead these logs to the log aggregator. This method permits the logs to be managed individually from the EKS cluster and makes use of a small quantity of assets as a result of the sidecar container is just launched through the lifetime of the Spark job.

Implementing Fluent Bit as a sidecar container

Fluent Bit is a light-weight, extremely scalable, and high-speed logging and metrics processor and log forwarder. It collects occasion knowledge from any supply, enriches that knowledge, and sends it to any vacation spot. Its light-weight and environment friendly design coupled with its many options makes it very enticing to these working within the cloud and in containerized environments. It has been deployed extensively and trusted by many, even in massive and complicated environments. Fluent Bit has zero dependencies and requires solely 650 KB in reminiscence to function, as in comparison with FluentD, which wants about 40 MB in reminiscence. Subsequently, it’s a super choice as a log forwarder to ahead logs generated from Spark jobs.

While you submit a job to EMR on EKS, there are at the very least two Spark containers: the Spark driver and the Spark executor. The variety of Spark executor pods relies on your job submission configuration. If you happen to point out a couple of spark.executor.cases, you get the corresponding variety of Spark executor pods. What we need to do right here is run Fluent Bit as sidecar containers with the Spark driver and executor pods. Diagrammatically, it appears like the next determine. The Fluent Bit sidecar container reads the indicated logs within the Spark driver and executor pods, and forwards these logs to the goal log aggregator straight.

Architecture of Fluent Bit sidecar

Pod templates in EMR on EKS

A Kubernetes pod is a gaggle of a number of containers with shared storage, community assets, and a specification for the right way to run the containers. Pod templates are specs for creating pods. It’s a part of the specified state of the workload assets used to run the appliance. Pod template information can outline the driving force or executor pod configurations that aren’t supported in normal Spark configuration. That being stated, Spark is opinionated about sure pod configurations and a few values within the pod template are all the time overwritten by Spark. Utilizing a pod template solely permits Spark to begin with a template pod and never an empty pod through the pod constructing course of. Pod templates are enabled in EMR on EKS once you configure the Spark properties spark.kubernetes.driver.podTemplateFile and spark.kubernetes.executor.podTemplateFile. Spark downloads these pod templates to assemble the driving force and executor pods.

Ahead logs generated by Spark jobs in EMR on EKS

A log aggregating system like Amazon OpenSearch Service or Splunk ought to all the time be accessible that may settle for the logs forwarded by the Fluent Bit sidecar containers. If not, we offer the next scripts on this publish that will help you launch a log aggregating system like Amazon OpenSearch Service or Splunk put in on an Amazon Elastic Compute Cloud (Amazon EC2) occasion.

We use a number of companies to create and configure EMR on EKS. We use an AWS Cloud9 workspace to run all of the scripts and to configure the EKS cluster. To organize to run a job script that requires sure Python libraries absent from the generic EMR photographs, we use Amazon Elastic Container Registry (Amazon ECR) to retailer the personalized EMR container picture.

Create an AWS Cloud9 workspace

Step one is to launch and configure the AWS Cloud9 workspace by following the directions in Create a Workspace within the EKS Workshop. After you create the workspace, we create AWS Identification and Entry Administration (IAM) assets. Create an IAM function for the workspace, connect the function to the workspace, and replace the workspace IAM settings.

Put together the AWS Cloud9 workspace

Clone the next GitHub repository and run the next script to organize the AWS Cloud9 workspace to be prepared to put in and configure Amazon EKS and EMR on EKS. The shell script prepare_cloud9.sh installs all the required elements for the AWS Cloud9 workspace to construct and handle the EKS cluster. These embrace the kubectl command line instrument, eksctl CLI instrument, jq, and to replace the AWS Command Line Interface (AWS CLI).

$ sudo yum -y set up git
$ cd ~ 
$ git clone https://github.com/aws-samples/aws-emr-eks-log-forwarding.git
$ cd aws-emr-eks-log-forwarding
$ cd emreks
$ bash prepare_cloud9.sh

All the required scripts and configuration to run this answer are discovered within the cloned GitHub repository.

Create a key pair

As a part of this explicit deployment, you want an EC2 key pair to create an EKS cluster. If you have already got an current EC2 key pair, chances are you’ll use that key pair. In any other case, you may create a key pair.

Set up Amazon EKS and EMR on EKS

After you configure the AWS Cloud9 workspace, in the identical folder (emreks), run the next deployment script:

$ bash deploy_eks_cluster_bash.sh 
Deployment Script -- EMR on EKS
-----------------------------------------------

Please present the next info earlier than deployment:
1. Area (In case your Cloud9 desktop is in the identical area as your deployment, you may go away this clean)
2. Account ID (In case your Cloud9 desktop is operating in the identical Account ID as the place your deployment will probably be, you may go away this clean)
3. Title of the S3 bucket to be created for the EMR S3 storage location
Area: [xx-xxxx-x]: < Press enter for default or enter area > 
Account ID [xxxxxxxxxxxx]: < Press enter for default or enter account # > 
EC2 Public Key identify: < Present your key pair identify right here >
Default S3 bucket identify for EMR on EKS (don't add s3://): < bucket identify >
Bucket created: XXXXXXXXXXX ...
Deploying CloudFormation stack with the next parameters...
Area: xx-xxxx-x | Account ID: xxxxxxxxxxxx | S3 Bucket: XXXXXXXXXXX

...

EKS Cluster and Digital EMR Cluster have been put in.

The final line signifies that set up was profitable.

Log aggregation choices

There are a number of log aggregation and administration instruments available on the market. This publish suggests two of the extra fashionable ones within the trade: Splunk and Amazon OpenSearch Service.

Choice 1: Set up Splunk Enterprise

To manually set up Splunk on an EC2 occasion, full the next steps:

  1. Launch an EC2 occasion.
  2. Set up Splunk.
  3. Configure the EC2 occasion safety group to allow entry to ports 22, 8000, and 8088.

This publish, nevertheless, supplies an automatic approach to set up Spunk on an EC2 occasion:

  1. Obtain the RPM set up file and add it to an accessible Amazon S3 location.
  2. Add the next YAML script into AWS CloudFormation.
  3. Present the required parameters, as proven within the screenshots beneath.
  4. Select Subsequent and full the steps to create your stack.

Splunk CloudFormation screen - 1

Splunk CloudFormation screen - 2

Splunk CloudFormation screen - 3

Alternatively, run an AWS CLI script like the next:

aws cloudformation create-stack 
--stack-name "splunk" 
--template-body file://splunk_cf.yaml 
--parameters ParameterKey=KeyName,ParameterValue="< Title of EC2 Key Pair >" 
  ParameterKey=InstanceType,ParameterValue="t3.medium" 
  ParameterKey=LatestAmiId,ParameterValue="/aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2" 
  ParameterKey=VPCID,ParameterValue="vpc-XXXXXXXXXXX" 
  ParameterKey=PublicSubnet0,ParameterValue="subnet-XXXXXXXXX" 
  ParameterKey=SSHLocation,ParameterValue="< CIDR Vary for SSH entry >" 
  ParameterKey=VpcCidrRange,ParameterValue="172.20.0.0/16" 
  ParameterKey=RootVolumeSize,ParameterValue="100" 
  ParameterKey=S3BucketName,ParameterValue="< S3 Bucket Title >" 
  ParameterKey=S3Prefix,ParameterValue="splunk/splunk-8.2.5-77015bc7a462-linux-2.6-x86_64.rpm" 
  ParameterKey=S3DownloadLocation,ParameterValue="/tmp" 
--region < area > 
--capabilities CAPABILITY_IAM
  1. After you construct the stack, navigate to the stack’s Outputs tab on the AWS CloudFormation console and be aware the interior and exterior DNS for the Splunk occasion.

You employ these later to configure the Splunk occasion and log forwarding.

Splunk CloudFormation output screen

  1. To configure Splunk, go to the Sources tab for the CloudFormation stack and find the bodily ID of EC2Instance.
  2. Select that hyperlink to go to the precise EC2 occasion.
  3. Choose the occasion and select Join.

Connect to Splunk Instance

  1. On the Session Supervisor tab, select Join.

Connect to Instance

You’re redirected to the occasion’s shell.

  1. Set up and configure Splunk as follows:
$ sudo /decide/splunk/bin/splunk begin --accept-license
…
Please enter an administrator username: admin
Password should include at the very least:
   * 8 complete printable ASCII character(s).
Please enter a brand new password: 
Please affirm new password:
…
Carried out
                                                           [  OK  ]

Ready for internet server at http://127.0.0.1:8000 to be accessible......... Carried out
The Splunk internet interface is at http://ip-xx-xxx-xxx-x.us-east-2.compute.inner:8000
  1. Enter the Splunk web site utilizing the SplunkPublicDns worth from the stack outputs (for instance, http://ec2-xx-xxx-xxx-x.us-east-2.compute.amazonaws.com:8000). Observe the port variety of 8000.
  2. Log in with the person identify and password you offered.

Splunk Login

Configure HTTP Occasion Collector

To configure Splunk to have the ability to obtain logs from Fluent Bit, configure the HTTP Occasion Collector knowledge enter:

  1. Go to Settings and select Knowledge enter.
  2. Select HTTP Occasion Collector.
  3. Select World Settings.
  4. Choose Enabled, maintain port quantity 8088, then select Save.
  5. Select New Token.
  6. For Title, enter a reputation (for instance, emreksdemo).
  7. Select Subsequent.
  8. For Obtainable merchandise(s) for Indexes, add at the very least the primary index.
  9. Select Evaluate after which Submit.
  10. Within the checklist of HTTP Occasion Accumulate tokens, copy the token worth for emreksdemo.

You employ it when configuring the Fluent Bit output.

splunk-http-collector-list

Choice 2: Arrange Amazon OpenSearch Service

Your different log aggregation choice is to make use of Amazon OpenSearch Service.

Provision an OpenSearch Service area

Provisioning an OpenSearch Service area could be very easy. On this publish, we offer a easy script and configuration to provision a primary area. To do it your self, discuss with Creating and managing Amazon OpenSearch Service domains.

Earlier than you begin, get the ARN of the IAM function that you simply use to run the Spark jobs. If you happen to created the EKS cluster with the offered script, go to the CloudFormation stack emr-eks-iam-stack. On the Outputs tab, find the IAMRoleArn output and replica this ARN. We additionally modify the IAM function in a while, after we create the OpenSearch Service area.

iam_role_emr_eks_job

If you happen to’re utilizing the offered opensearch.sh installer, earlier than you run it, modify the file.

From the basis folder of the GitHub repository, cd to opensearch and modify opensearch.sh (you can too use your most well-liked editor):

[../aws-emr-eks-log-forwarding] $ cd opensearch
[../aws-emr-eks-log-forwarding/opensearch] $ vi opensearch.sh

Configure opensearch.sh to suit your atmosphere, for instance:

# identify of our Amazon OpenSearch cluster
export ES_DOMAIN_NAME="emreksdemo"

# Elasticsearch model
export ES_VERSION="OpenSearch_1.0"

# Occasion Kind
export INSTANCE_TYPE="t3.small.search"

# OpenSearch Dashboards admin person
export ES_DOMAIN_USER="emreks"

# OpenSearch Dashboards admin password
export ES_DOMAIN_PASSWORD='< ADD YOUR PASSWORD >'

# Area
export REGION='us-east-1'

Run the script:

[../aws-emr-eks-log-forwarding/opensearch] $ bash opensearch.sh

Configure your OpenSearch Service area

After you arrange your OpenSearch service area and it’s energetic, make the next configuration modifications to permit logs to be ingested into Amazon OpenSearch Service:

  1. On the Amazon OpenSearch Service console, on the Domains web page, select your area.

Opensearch Domain Console

  1. On the Safety configuration tab, select Edit.

Opensearch Security Configuration

  1. For Entry Coverage, choose Solely use fine-grained entry management.
  2. Select Save modifications.

The entry coverage ought to appear like the next code:

{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:xx-xxxx-x:xxxxxxxxxxxx:domain/emreksdemo/*"
    }
  ]
}
  1. When the area is energetic once more, copy the area ARN.

We use it to configure the Amazon EMR job IAM function we talked about earlier.

  1. Select the hyperlink for OpenSearch Dashboards URL to enter Amazon OpenSearch Service Dashboards.

Opensearch Main Console

  1. In Amazon OpenSearch Service Dashboards, use the person identify and password that you simply configured earlier within the opensearch.sh file.
  2. Select the choices icon and select Safety underneath OpenSearch Plugins.

opensearch menu

  1. Select Roles.
  2. Select Create function.

opensearch-create-role-button

  1. Enter the brand new function’s identify, cluster permissions, and index permissions. For this publish, identify the function fluentbit_role and provides cluster permissions to the next:
    1. indices:admin/create
    2. indices:admin/template/get
    3. indices:admin/template/put
    4. cluster:admin/ingest/pipeline/get
    5. cluster:admin/ingest/pipeline/put
    6. indices:knowledge/write/bulk
    7. indices:knowledge/write/bulk*
    8. create_index

opensearch-create-role-button

  1. Within the Index permissions part, give write permission to the index fluent-*.
  2. On the Mapped customers tab, select Handle mapping.
  3. For Backend roles, enter the Amazon EMR job execution IAM function ARN to be mapped to the fluentbit_role function.
  4. Select Map.

opensearch-map-backend

  1. To finish the safety configuration, go to the IAM console and add the next inline coverage to the EMR on EKS IAM function entered within the backend function. Exchange the useful resource ARN with the ARN of your OpenSearch Service area.
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "es:ESHttp*"
            ],
            "Useful resource": "arn:aws:es:us-east-2:XXXXXXXXXXXX:area/emreksdemo"
        }
    ]
}

The configuration of Amazon OpenSearch Service is full and prepared for ingestion of logs from the Fluent Bit sidecar container.

Configure the Fluent Bit sidecar container

We have to write two configuration information to configure a Fluent Bit sidecar container. The primary is the Fluent Bit configuration itself, and the second is the Fluent Bit sidecar subprocess configuration that makes positive that the sidecar operation ends when the primary Spark job ends. The prompt configuration offered on this publish is for Splunk and Amazon OpenSearch Service. Nonetheless, you may configure Fluent Bit with different third-party log aggregators. For extra details about configuring outputs, discuss with Outputs.

Fluent Bit ConfigMap

The next pattern ConfigMap is from the GitHub repo:

apiVersion: v1
variety: ConfigMap
metadata:
  identify: fluent-bit-sidecar-config
  namespace: sparkns
  labels:
    app.kubernetes.io/identify: fluent-bit
knowledge:
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     data
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-application.conf
    @INCLUDE input-event-logs.conf
    @INCLUDE output-splunk.conf
    @INCLUDE output-opensearch.conf

  input-application.conf: |
    [INPUT]
        Title              tail
        Path              /var/log/spark/person/*/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  input-event-logs.conf: |
    [INPUT]
        Title              tail
        Path              /var/log/spark/apps/*
        Path_Key          filename
        Buffer_Chunk_Size 1M
        Buffer_Max_Size   5M
        Skip_Long_Lines   On
        Skip_Empty_Lines  On

  output-splunk.conf: |
    [OUTPUT]
        Title            splunk
        Match           *
        Host            < INTERNAL DNS of Splunk EC2 Occasion >
        Port            8088
        TLS             On
        TLS.Confirm      Off
        Splunk_Token    < Token as offered by the HTTP Occasion Collector in Splunk >

  output-opensearch.conf: |
[OUTPUT]
        Title            es
        Match           *
        Host            < HOST NAME of the OpenSearch Area | No HTTP protocol >
        Port            443
        TLS             On
        AWS_Auth        On
        AWS_Region      < Area >
        Retry_Limit     6

In your AWS Cloud9 workspace, modify the ConfigMap accordingly. Present the values for the placeholder textual content by operating the next instructions to enter the VI editor mode. If most well-liked, you should use PICO or a distinct editor:

[../aws-emr-eks-log-forwarding] $  cd kube/configmaps
[../aws-emr-eks-log-forwarding/kube/configmaps] $ vi emr_configmap.yaml

# Modify the emr_configmap.yaml as above
# Save the file as soon as it's accomplished

Full both the Splunk output configuration or the Amazon OpenSearch Service output configuration.

Subsequent, run the next instructions so as to add the 2 Fluent Bit sidecar and subprocess ConfigMaps:

[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_configmap.yaml
[../aws-emr-eks-log-forwarding/kube/configmaps] $ kubectl apply -f emr_entrypoint_configmap.yaml

You don’t want to change the second ConfigMap as a result of it’s the subprocess script that runs contained in the Fluent Bit sidecar container. To confirm that the ConfigMaps have been put in, run the next command:

$ kubectl get cm -n sparkns
NAME                         DATA   AGE
fluent-bit-sidecar-config    6      15s
fluent-bit-sidecar-wrapper   2      15s

Arrange a personalized EMR container picture

To run the pattern PySpark script, the script requires the Boto3 bundle that’s not accessible in the usual EMR container photographs. If you wish to run your personal script and it doesn’t require a personalized EMR container picture, chances are you’ll skip this step.

Run the next script:

[../aws-emr-eks-log-forwarding] $ cd ecr
[../aws-emr-eks-log-forwarding/ecr] $ bash create_custom_image.sh <area> <EMR container picture account quantity>

The EMR container picture account quantity may be obtained from How one can choose a base picture URI. This documentation additionally supplies the suitable ECR registry account quantity. For instance, the registry account quantity for us-east-1 is 755674844232.

To confirm the repository and picture, run the next instructions:

$ aws ecr describe-repositories --region < area > | grep emr-6.5.0-custom
            "repositoryArn": "arn:aws:ecr:xx-xxxx-x:xxxxxxxxxxxx:repository/emr-6.5.0-custom",
            "repositoryName": "emr-6.5.0-custom",
            "repositoryUri": " xxxxxxxxxxxx.dkr.ecr.xx-xxxx-x.amazonaws.com/emr-6.5.0-custom",

$ aws ecr describe-images --region < area > --repository-name emr-6.5.0-custom | jq .imageDetails[0].imageTags
[
  "latest"
]

Put together pod templates for Spark jobs

Add the 2 Spark driver and Spark executor pod templates to an S3 bucket and prefix. The 2 pod templates may be discovered within the GitHub repository:

  • emr_driver_template.yaml – Spark driver pod template
  • emr_executor_template.yaml – Spark executor pod template

The pod templates offered right here shouldn’t be modified.

Submitting a Spark job with a Fluent Bit sidecar container

This Spark job instance makes use of the bostonproperty.py script. To make use of this script, add it to an accessible S3 bucket and prefix and full the previous steps to make use of an EMR personalized container picture. You additionally have to add the CSV file from the GitHub repo, which you must obtain and unzip. Add the unzipped file to the next location: s3://<your chosen bucket>/<first stage folder>/knowledge/boston-property-assessment-2021.csv.

The next instructions assume that you simply launched your EKS cluster and digital EMR cluster with the parameters indicated within the GitHub repo.

Variable The place to Discover the Info or the Worth Required
EMR_EKS_CLUSTER_ID Amazon EMR console digital cluster web page
EMR_EKS_EXECUTION_ARN IAM function ARN
EMR_RELEASE emr-6.5.0-latest
S3_BUCKET The bucket you create in Amazon S3
S3_FOLDER The popular prefix you need to use in Amazon S3
CONTAINER_IMAGE The URI in Amazon ECR the place your container picture is
SCRIPT_NAME emreksdemo-script or a reputation you like

Alternatively, use the offered script to run the job. Change the listing to the scripts folder in emreks and run the script as follows:

[../aws-emr-eks-log-forwarding] cd emreks/scripts
[../aws-emr-eks-log-forwarding/emreks/scripts] bash run_emr_script.sh < S3 bucket identify > < ECR container picture > < script path>

Instance: bash run_emr_script.sh emreksdemo-123456 12345678990.dkr.ecr.us-east-2.amazonaws.com/emr-6.5.0-custom s3://emreksdemo-123456/scripts/scriptname.py

After you submit the Spark job efficiently, you get a return JSON response like the next:

{
    "id": "0000000305e814v0bpt",
    "identify": "emreksdemo-job",
    "arn": "arn:aws:emr-containers:xx-xxxx-x:XXXXXXXXXXX:/virtualclusters/upobc00wgff5XXXXXXXXXXX/jobruns/0000000305e814v0bpt",
    "virtualClusterId": "upobc00wgff5XXXXXXXXXXX"
}

What occurs once you submit a Spark job with a sidecar container

After you submit a Spark job, you may see what is going on by viewing the pods which are generated and the corresponding logs. First, utilizing kubectl, get a listing of the pods generated within the namespace the place the EMR digital cluster runs. On this case, it’s sparkns. The primary pod within the following code is the job controller for this explicit Spark job. The second pod is the Spark executor; there may be a couple of pod relying on what number of executor cases are requested for within the Spark job setting—we requested for one right here. The third pod is the Spark driver pod.

$ kubectl get pods -n sparkns
NAME                                        READY   STATUS    RESTARTS   AGE
0000000305e814v0bpt-hvwjs                   3/3     Working   0          25s
emreksdemo-script-1247bf80ae40b089-exec-1   0/3     Pending   0          0s
spark-0000000305e814v0bpt-driver            3/3     Working   0          11s

To view what occurs within the sidecar container, comply with the logs within the Spark driver pod and discuss with the sidecar. The sidecar container launches with the Spark pods and persists till the file /var/log/fluentd/main-container-terminated is now not accessible. For extra details about how Amazon EMR controls the pod lifecycle, discuss with Utilizing pod templates. The subprocess script ties the sidecar container to this similar lifecycle and deletes itself upon the EMR managed pod lifecycle course of.

$ kubectl logs spark-0000000305e814v0bpt-driver -n sparkns  -c custom-side-car-container --follow=true

Ready for file /var/log/fluentd/main-container-terminated to look...
AWS for Fluent Bit Container Picture Model 2.24.0Start wait: 1652190909
Elapsed Wait: 0
Not discovered rely: 0
Ready...
Fluent Bit v1.9.3
* Copyright (C) 2015-2022 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project underneath the umbrella of Fluentd
* https://fluentbit.io

[2022/05/10 13:55:09] [ info] [fluent bit] model=1.9.3, commit=9eb4996b7d, pid=11
[2022/05/10 13:55:09] [ info] [storage] model=1.2.0, kind=memory-only, sync=regular, checksum=disabled, max_chunks_up=128
[2022/05/10 13:55:09] [ info] [cmetrics] model=0.3.1
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] employee #0 began
[2022/05/10 13:55:09] [ info] [output:splunk:splunk.0] employee #1 began
[2022/05/10 13:55:09] [ info] [output:es:es.1] employee #0 began
[2022/05/10 13:55:09] [ info] [output:es:es.1] employee #1 began
[2022/05/10 13:55:09] [ info] [http_server] hear iface=0.0.0.0 tcp_port=2020
[2022/05/10 13:55:09] [ info] [sp] stream processor began
Ready for file /var/log/fluentd/main-container-terminated to look...
Final heartbeat: 1652190914
Elapsed Time since after heartbeat: 0
Discovered rely: 0
checklist information:
-rw-r--r-- 1 saslauth 65534 0 Could 10 13:55 /var/log/fluentd/main-container-terminated
Final heartbeat: 1652190918

…

[2022/05/10 13:56:09] [ info] [input:tail:tail.0] inotify_fs_add(): inode=58834691 watch_fd=6 identify=/var/log/spark/person/spark-0000000305e814v0bpt-driver/stdout-s3-container-log-in-tail.pos
[2022/05/10 13:56:09] [ info] [input:tail:tail.1] inotify_fs_add(): inode=54644346 watch_fd=1 identify=/var/log/spark/apps/spark-0000000305e814v0bpt
Outdoors of loop, main-container-terminated file now not exists
ls: can't entry /var/log/fluentd/main-container-terminated: No such file or listing
The file /var/log/fluentd/main-container-terminated would not exist anymore;
TERMINATED PROCESS
Fluent-Bit pid: 11
Killing course of after sleeping for 15 seconds
root        11     8  0 13:55 ?        00:00:00 /fluent-bit/bin/fluent-bit -e /fluent-bit/firehose.so -e /fluent-bit/cloudwatch.so -e /fluent-bit/kinesis.so -c /fluent-bit/and so forth/fluent-bit.conf
root       114     7  0 13:56 ?        00:00:00 grep fluent
Killing course of 11
[2022/05/10 13:56:24] [engine] caught sign (SIGTERM)
[2022/05/10 13:56:24] [ info] [input] pausing tail.0
[2022/05/10 13:56:24] [ info] [input] pausing tail.1
[2022/05/10 13:56:24] [ warn] [engine] service will shutdown in max 5 seconds
[2022/05/10 13:56:25] [ info] [engine] service has stopped (0 pending duties)
[2022/05/10 13:56:25] [ info] [input:tail:tail.1] inotify_fs_remove(): inode=54644346 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917120 watch_fd=1
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=60917121 watch_fd=2
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834690 watch_fd=3
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834692 watch_fd=4
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834689 watch_fd=5
[2022/05/10 13:56:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=58834691 watch_fd=6
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #0 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #0 stopped
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #1 stopping...
[2022/05/10 13:56:25] [ info] [output:splunk:splunk.0] thread employee #1 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #0 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #0 stopped
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #1 stopping...
[2022/05/10 13:56:25] [ info] [output:es:es.1] thread employee #1 stopped

View the forwarded logs in Splunk or Amazon OpenSearch Service

To view the forwarded logs, do a search in Splunk or on the Amazon OpenSearch Service console. If you happen to’re utilizing a shared log aggregator, you’ll have to filter the outcomes. On this configuration, the logs tailed by Fluent Bit are within the /var/log/spark/*. The next screenshots present the logs generated particularly by the Kubernetes Spark driver stdout that had been forwarded to the log aggregators. You’ll be able to evaluate the outcomes with the logs offered utilizing kubectl:

kubectl logs < Spark Driver Pod > -n < namespace > -c spark-kubernetes-driver --follow=true

…
root
 |-- PID: string (nullable = true)
 |-- CM_ID: string (nullable = true)
 |-- GIS_ID: string (nullable = true)
 |-- ST_NUM: string (nullable = true)
 |-- ST_NAME: string (nullable = true)
 |-- UNIT_NUM: string (nullable = true)
 |-- CITY: string (nullable = true)
 |-- ZIPCODE: string (nullable = true)
 |-- BLDG_SEQ: string (nullable = true)
 |-- NUM_BLDGS: string (nullable = true)
 |-- LUC: string (nullable = true)
…

|02108|RETAIL CONDO           |361450.0            |63800.0        |5977500.0      |
|02108|RETAIL STORE DETACH    |2295050.0           |988200.0       |3601900.0      |
|02108|SCHOOL                 |1.20858E7           |1.20858E7      |1.20858E7      |
|02108|SINGLE FAM DWELLING    |5267156.561085973   |1153400.0      |1.57334E7      |
+-----+-----------------------+--------------------+---------------+---------------+
solely exhibiting prime 50 rows

The next screenshot exhibits the Splunk logs.

splunk-result-driver-stdout

The next screenshots present the Amazon OpenSearch Service logs.

opensearch-result-driver-stdout

Non-obligatory: Embrace a buffer between Fluent Bit and the log aggregators

If you happen to count on to generate loads of logs due to excessive concurrent Spark jobs creating a number of particular person connects which will overwhelm your Amazon OpenSearch Service or Splunk log aggregation clusters, contemplate using a buffer between the Fluent Bit sidecars and your log aggregator. One choice is to make use of Amazon Kinesis Knowledge Firehose because the buffering service.

Kinesis Knowledge Firehose has built-in supply to each Amazon OpenSearch Service and Splunk. If utilizing Amazon OpenSearch Service, discuss with Loading streaming knowledge from Amazon Kinesis Knowledge Firehose. If utilizing Splunk, discuss with Configure Amazon Kinesis Firehose to ship knowledge to the Splunk platform and Select Splunk for Your Vacation spot.

To configure Fluent Bit to Kinesis Knowledge Firehose, add the next to your ConfigMap output. Discuss with the GitHub ConfigMap instance and add the @INCLUDE underneath the [SERVICE] part:

     @INCLUDE output-kinesisfirehose.conf
…

  output-kinesisfirehose.conf: |
    [OUTPUT]
        Title            kinesis_firehose
        Match           *
        area          < area >
        delivery_stream < Kinesis Firehose Stream Title >

Non-obligatory: Use knowledge streams for Amazon OpenSearch Service

If you happen to’re in a state of affairs the place the variety of paperwork grows quickly and also you don’t have to replace older paperwork, you must handle the OpenSearch Service cluster. This entails steps like making a rollover index alias, defining a write index, and defining widespread mappings and settings for the backing indexes. Think about using knowledge streams to simplify this course of and implement a setup that most accurately fits your time collection knowledge. For directions on implementing knowledge streams, discuss with Knowledge streams.

Clear up

To keep away from incurring future fees, delete the assets by deleting the CloudFormation stacks that had been created with this script. This removes the EKS cluster. Nonetheless, earlier than you do this, take away the EMR digital cluster first by operating the delete-virtual-cluster command. Then delete all of the CloudFormation stacks generated by the deployment script.

If you happen to launched an OpenSearch Service area, you may delete the area from the OpenSearch Service area. If you happen to used the script to launch a Splunk occasion, you may go to the CloudFormation stack that launched the Splunk occasion and delete the CloudFormation stack. This removes take away the Splunk occasion and related assets.

You too can use the next scripts to scrub up assets:

Conclusion

EMR on EKS facilitates operating Spark jobs on Kubernetes to attain very quick and cost-efficient Spark operations. That is made potential by scheduling transient pods which are launched after which deleted the roles are full. To log all these operations in the identical lifecycle of the Spark jobs, this publish supplies an answer utilizing pod templates and Fluent Bit that’s light-weight and highly effective. This method presents a decoupled means of log forwarding primarily based on the Spark software stage and never on the Kubernetes cluster stage. It additionally avoids routing by intermediaries like CloudWatch, decreasing value and complexity. On this means, you may handle safety issues and DevOps and system administration ease of administration whereas offering Spark customers with insights into their Spark jobs in a cost-efficient and practical means.

When you have questions or options, please go away a remark.


In regards to the Writer

Matthew Tan is a Senior Analytics Options Architect at Amazon Net Providers and supplies steerage to prospects growing options with AWS Analytics companies on their analytics workloads.                       

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments