Open menu
-->

Enable AWS EMR Cluster Logging to S3

Cloud Conformity allows you to automate the auditing process of this resolution page. Register for a 14 day evaluation and check your compliance level for free!

Start a Free Trial Product features
Operational
excellence

Risk level: Low (generally tolerable level of risk)

Ensure that all Amazon EMR cluster log files are periodically archived and uploaded to S3 in order to keep the logging data for historical purposes or to track and analyze the EMR clusters behavior for a long period of time.

By default, all EMR log files are automatically deleted from the clusters after the retention period ends. With this feature enabled, Elastic MapReduce uploads the log files from the cluster master instance(s) to Amazon S3 so the logging data (step logs, Hadoop logs, instance state logs, etc) can be utilized later for troubleshooting or compliance purposes. Once active, the EMR service archives and sends the log files to Amazon S3 at 5 minute intervals.

Audit

To determine if Amazon EMR clusters captures log data to S3, perform the following:

Using AWS Console

01 Login to the AWS Management Console.

02 Navigate to EMR dashboard at https://console.aws.amazon.com/elasticmapreduce/.

03 In the left navigation panel, under Amazon EMR, click Cluster list to access your AWS EMR clusters page.

04 Select the EMR cluster that you want to examine then click on the View details button from the dashboard top menu.

05 On the selected cluster configuration page, within the Configuration Details section, verify the path to the S3 location (e.g. s3n://aws-logs-123456789012-us-east-1/elasticmapreduce/) where the cluster log files are copied, listed as value for the Log URI attribute. If the Log URI configuration attribute does not have any value, i.e.

Log URI

the selected AWS Elastic MapReduce cluster does not capture log data to Amazon S3 for later use.

06 Repeat step no. 4 and 5 to verify other AWS EMR clusters provisioned in the current region.

07 Change the AWS region from the navigation bar and repeat the audit process for other regions.

Using AWS CLI

01 Run list-clusters command (OSX/Linux/UNIX) using custom query filters to list the identifiers (IDs) of all the active Amazon EMR clusters available in the selected region:

aws emr list-clusters
    --region us-east-1
    --active
    --output table
    --query 'Clusters[*].Id'

02 The command output should return a table with the requested cluster IDs:

---------------------
|   ListClusters    |
+-------------------+
|  j-3DAZQCVIJ5BXE  |
|  j-3U51U2E1F3GPX  |
+-------------------+

03 Run describe-cluster command (OSX/Linux/UNIX) using the ID of the cluster that you want to examine and custom query filters to describe the log files storage location URI (e.g. s3n://aws-logs-123456789012-us-east-1/elasticmapreduce/) used by the selected Amazon EMR cluster:

aws emr describe-cluster
    --region us-east-1
    --cluster-id j-3DAZQCVIJ5BXE
    --query 'Cluster.LogUri'

04 The command output should return the S3 location (URI) used by the cluster logging system:

null

If the value returned by the command output is null, the selected AWS Elastic MapReduce cluster does not archive the log files to Amazon S3 for audit purposes.

05 Repeat step no. 3 and 4 for each Amazon EMR cluster available in the current region.

06 Change the AWS region by updating the --region command parameter value and repeat steps no. 1 - 5 to perform the entire audit process for other regions.

Remediation / Resolution

To enable Amazon EMR cluster logging to S3 you need to clone the required cluster and change its logging configuration by performing the following commands:

Using AWS Console

01 Login to the AWS Management Console.

02 Navigate to EMR dashboard at https://console.aws.amazon.com/elasticmapreduce/.

03 In the navigation panel, under Amazon EMR, click Cluster list to access your AWS EMR clusters page.

04 Select the EMR cluster that you want to upgrade (see Audit section part I to identify the right resource) then click on the Clone button from the dashboard top menu.

05 Inside the Cloning <your-cluster-ID> dialog box, choose Yes to include the steps from the original cluster in the cloned cluster or No to clone the original cluster's configuration without including any of the existing steps. Click Clone to start the cloning process.

06 On the Create Cluster page, select Step 3: General Cluster Settings from the left navigation panel to access the cloned cluster general settings.

07 On the General Options panel, under Cluster Name, click on the Logging checkbox to enable the feature. Once enabled, the EMR dashboard will display the S3 default folder path where the cluster log files will be uploaded automatically.

08 Click the Next button without changing any other configuration options.

09 On the Security Options panel, click Create Cluster to create your new (cloned) AWS EMR cluster.

10 Once you have moved the existing cluster data and verified that your new EMR cluster is working 100% with the new configuration, shut down the original cluster (i.e. the one without logging to S3 feature enabled) to stop incurring charges for the resource. To terminate the old EMR cluster, perform the following:

  1. Go back to the navigation panel and under Amazon EMR, choose Cluster list.
  2. Select the AWS EMR cluster that you want to shut down.
  3. Click on the Terminate button from the dashboard top menu.
  4. In the Terminate clusters confirmation box, review the original cluster details then click Terminate.

11 Repeat steps no. 4 - 10 to enable cluster logging to S3 for other Amazon EMR cluster provisioned in the current region.

12 Change the AWS region from the navigation bar and repeat the entire remediation process for other regions.

Using AWS CLI

01 Get the configuration details from the existing (original) EMR cluster, required for the next step. Run describe-cluster command (OSX/Linux/UNIX) using the ID of the cluster that you want to re-create (see Audit section part II to identify the right resource), to describe all its configuration details:

aws emr describe-cluster
    --region us-east-1
    --cluster-id j-3DAZQCVIJ5BXE

02 The command output should return the running EMR cluster configuration metadata:

{
   "Cluster": {
     "Name": "BigDataEMRCluster",
     "ServiceRole": "EMR_DefaultRole",
     "Tags": [],
     "TerminationProtected": false,
     "ReleaseLabel": "emr-5.3.1",
     "NormalizedInstanceHours": 2,

     ...

     "ScaleDownBehavior": "TERMINATE_AT_INSTANCE_HOUR",
     "MasterPublicDnsName": "ec2-182-75-146-89.compute-1.amazonaws.com",
     "VisibleToAllUsers": true,
     "BootstrapActions": [],
     "AutoTerminate": false,
     "Id": "3DAZQCVIJ5BXE",
     "Configurations": []
   }
}

03 Run create-cluster command (OSX/Linux/UNIX) using the configuration details returned at the previous step as parameter values to re-create the running EMR cluster with the logging to AWS S3 feature enabled. The following command example creates an AWS Elastic MapReduce cluster, named NewBigDataEMRCluster, that captures log data to Amazon S3 at s3n://aws-logs-123456789012-us-east-1/emr-cluster-logs/:

aws emr create-cluster
    --region us-east-1
    --name NewBigDataEMRCluster
    --log-uri s3n://aws-logs-123456789012-us-east-1/emr-cluster-logs/
    --release-label emr-5.3.1
    --visible-to-all-users
    --service-role EMR_DefaultRole
    --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=1,InstanceType=m3.xlarge
    --ec2-attributes KeyName=EMRKey,InstanceProfile=EMR_EC2_DefaultRole,EmrManagedMasterSecurityGroup=sg-cabf68e5,EmrManagedSlaveSecurityGroup=sg-eb9f63ca,AvailabilityZone=us-east-1b
    --no-auto-terminate
    --no-termination-protected

04 The command output should return the new EMR cluster ID:

{
    "ClusterId": "j-2GUZBF6NDX5I3"
}

05 Once the original cluster data is migrated and you have verified that your new EMR cluster is working 100% with the logging to S3 feature enabled, terminate the original cluster to stop incurring charges for it. To shut down the old EMR cluster run terminate-clusters command (OSX/Linux/UNIX) using its ID as identifier (the command does not produce an output):

aws emr terminate-clusters
    --region us-east-1
    --cluster-ids j-3DAZQCVIJ5BXE

06 Repeat steps no. 1 – 5 for each Amazon EMR cluster that requires logging to AWS S3, available in the current region.

07 Change the AWS region by updating the --region command parameter value and repeat the entire remediation process for other regions.

References

Publication date Feb 24, 2017