Open menu
-->

SQS Queue Unprocessed Messages

Cloud Conformity allows you to automate the auditing process of this resolution page. Register for a 14 day evaluation and check your compliance level for free!

Start a Free Trial Product features
Reliability

Risk level: Medium (should be achieved)

Ensure that your Amazon Simple Queue Service (SQS) queues are not holding a high number of unsuccessfully-processed messages due to unresponsive or incapacitated consumers. A consumer is an AWS compute resource such as an EC2 instance or a Lambda function that reads messages from the designated SQS queue and does the actual processing. The default threshold for the number of high SQS unprocessed messages is 100, however, you can easily change the threshold for this rule on the Cloud Conformity console.

This rule resolution is part of the Cloud Conformity Base Auditing Package

Whether you process raw images, transcode video files or send out a massive number of emails, you need to maintain the SQS consumers healthy and responsive by ensuring their availability and scalability within your environment or else you will end up with a large number of messages in your SQS queues, waiting to be processed.

Audit

To determine if there are any SQS queues that hold a high number of unprocessed messages within your AWS account, perform the following:

Using AWS Console

01 Sign in to the AWS Management Console.

02 Navigate to SQS dashboard at https://console.aws.amazon.com/sqs/.

03 Select the SQS queue that you want to examine.

04 Select the Details tab from the bottom panel and check the Messages Available (Visible) attribute value. If the value displayed here is equal or higher than the threshold value set (default or custom set on Cloud Conformity console), the selected AWS SQS queue holds too many unprocessed messages, therefore the consumers (workers) assigned to the SQS queue could be unhealthy or incapacitated.

05 Repeat step no. 3 and 4 for each Amazon SQS queue available in the current AWS region.

06 Change the AWS region from the navigation bar to repeat the audit process for other regions.

Using AWS CLI

01 Run list-queues command (OSX/Linux/UNIX) to list the URLs of all SQS queues available in the selected AWS region:

aws sqs list-queues
	--region us-east-1

02 The command output should return the requested SQS URL(s):

{
    "QueueUrls": [
        "https://queue.amazonaws.com/123456789012/TranscoderSQSQueue",
        "https://queue.amazonaws.com/123456789012/WebWorkerSQSQueue"
    ]
}

03 Run get-queue-attributes command (OSX/Linux/UNIX) using the queue URL returned at the previous step as identifier and custom query filters to return the number of messages currently available within the selected SQS queue:

aws sqs get-queue-attributes
	--region us-east-1
	--queue-url https://queue.amazonaws.com/123456789012/TranscoderSQSQueue
	--attribute-names ApproximateNumberOfMessages
	--query 'Attributes.ApproximateNumberOfMessages'

04 The command output should return the number of SQS queue messages available at the request time:

"139"

If the value returned is equal or higher than the threshold value (default or custom) set for the number of high unprocessed messages, the selected Amazon SQS queue holds too many unsuccessfully-processed messages, therefore the consumers subscribed to the SQS queue could be shut down or too busy.

05 Repeat step no. 3 and 4 for each AWS SQS queue provisioned in the current AWS region.

06 Change the AWS region by updating the --region command parameter value and repeat steps no. 1 - 5 to perform the entire process for other regions.

Remediation / Resolution

To restore the availability and scalability of your SQS consumers (workers) in order to prevent adding more unprocessed messages to the existing Amazon SQS queues, perform the following:

Using AWS Console

01 Sign in to the AWS Management Console.

02 Navigate to SQS dashboard at https://console.aws.amazon.com/sqs/.

03 Choose the SQS queue that keep a high number of unprocessed messages (see Audit section part I to identify the right resource) and identify its unresponsive/incapacitated consumer(s).

04 Based on the AWS resource type used for the SQS consumer, perform one of the following sets of actions:

  1. If the consumer/worker is an individual EC2 instance, perform the following:
    • Navigate to EC2 dashboard at https://console.aws.amazon.com/ec2/.
    • In the left navigation panel, under Instances section, click Instances.
    • Select the worker EC2 instance.
    • If the instance Status Check is failed and the resource is unreachable, execute the following:
      • Click the Actions button from the dashboard top menu, select Instance State, then choose Reboot.
      • In the Reboot Instances dialog box, review the instance details and click Yes, Reboot to reboot the instance.
    • If the Status Check is passed, most probably the instance is not having enough capacity to process the necessary SQS messages. To upgrade the resource type, execute the following:
      • Click Actions button from the dashboard top menu, select Instance State then select Stop.
      • In the Stop Instances dialog box, review the action details and click Yes, Stop to stop the instance.
      • Click Actions button again, select Instance Settings then select Change Instance Type.
      • In the Change Instance Type dialog box, choose the type of the resource to upgrade to from the Instance Type dropdown list, then click Apply to upgrade the instance type.
      • Click Actions button from the dashboard top menu, select Instance State then select Start.
      • In the Start Instances dialog box, click Yes, Start to restart the instance. The instance booting process and its system checks should take few minutes.
    • If the selected worker instance cannot resume the processing of the available SQS messages after reboot or capacity (instance type) upgrade, you may need to troubleshoot your worker application.
  2. If the consumer is a fleet of EC2 instance managed by an AWS Auto Scaling Group (ASG), perform the following:
    • Navigate to EC2 dashboard at https://console.aws.amazon.com/ec2/.
    • In the navigation panel, under AUTO SCALING section, choose Auto Scaling Groups.
    • Select the AWS ASG provisioned as SQS worker fleet.
    • If the group instances are healthy: If the group instances are healthy, most probably the ASG is not having enough capacity to consume the required SQS messages. To increase the size of the ASG in order to handle the load, execute the following:
      • Select the Details tab from the dashboard bottom panel and click the Edit button to update the selected ASG configuration.
      • Increase the number of EC2 worker instances available in the Desired box to add more compute power to the group. Depending on your ASG current configuration you may also need to increase the number of instances available in the Max field.
      • Click the Save button to save the changes. AWS ASG will now start to provision new EC2 instances and upgrade the compute capacity of the worker fleet.
    • If the consumer ASG cannot resume the SQS queue processing after the capacity (instances number) upgrade, you may need to troubleshoot your worker application.
  3. If the SQS consumer is a Lambda function, perform the following:
    • Navigate to Lambda dashboard at https://console.aws.amazon.com/lambda/.
    • In the navigation panel, under AWS Lambda section, choose Functions.
    • Select the Lambda function that serve as SQS consumer.
    • Select the Monitoring tab from the dashboard bottom panel then click View logs in CloudWatch link to access the selected function logs. Click the right log stream and analyze it for errors.
    • If the function log stream does not have any errors, most probably the Lambda function is not getting enough resources to process the designated SQS messages. To increase the serverless consumer resources, execute the following:
      • Select the Configuration tab and click Advanced settings to open the resources settings panel.
      • To increase the worker compute capacity, change the size of the memory allocated for the selected function by selecting one of the predefined values from the Memory (MB) dropdown list: Memory (MB) or change the existing timeout value within the Timeout min/sec configuration boxes: Timeout min/sec.
      • Click the Save button from the dashboard top menu to apply the changes.
    • If the Lambda serverless worker cannot resume the SQS queue processing after the capacity (memory) upgrade, click the Code tab available on the configuration page and troubleshoot your worker function.

05 Repeat step no. 3 and 4 for each SQS queue with unhealthy or incapacitated consumers, available in the current AWS region.

06 Change the AWS region from the navigation bar and repeat the process for other regions.

Using AWS CLI

01 Choose the SQS queue that hold a high number of unprocessed messages (see Audit section part II to identify the right SQS resource) and identify its consumer(s).

02 Based on the AWS SQS consumer (worker) type, perform one of the following sets of commands:

  1. If the worker is a single EC2 instance, perform the following:
    • If the worker is a single EC2 instance, perform the following:
      • Run describe-instance-status command (OSX/Linux/UNIX) to describe the current status of the specified EC2 instance:
        aws ec2 describe-instance-status
        	--region us-east-1
        	--instance-id i-0ed97c0e9ce54123a
        
      • The command output should return the selected worker instance status details:
        {
            "InstanceStatuses": [
                {
                    "InstanceId": "i-0ed97c0e9ce54123a",
                    "InstanceState": {
                        "Code": 16,
                        "Name": "running"
                    },
                    "AvailabilityZone": "us-east-1d",
                    "SystemStatus": {
                        "Status": "ok",
                        "Details": [
                            {
                                "Status": "failed",
                                "Name": "reachability"
                            }
                        ]
                    },
                    "InstanceStatus": {
                        "Status": "ok",
                        "Details": [
                            {
                                "Status": "failed",
                                "Name": "reachability"
                            }
                        ]
                    }
                }
            ]
        }
        
      • If SystemStatus and/or InstanceStatus attributes have the Status parameter value set to "failed" (as shown in the example above), the resource is unreachable and requires a reboot. If both attributes have the Status value set to "passed", skip to EC2 instance upgrade section.
      • Now run reboot-instances command (OSX/Linux/UNIX) to reboot the selected EC2 worker instance (the command does not produce an output):
        aws ec2 reboot-instances
        	--region us-east-1
        	--instance-ids i-0ed97c0e9ce54123a
        
      • If SystemStatus and InstanceStatus have the Status parameter value set to "passed", the instance may not have enough capacity to process the necessary SQS messages. To upgrade the instance resource type, you need to stop it first by executing stop-instances command (OSX/Linux/UNIX). The stop-instances command does not return an output:
        aws ec2 stop-instances
        	--region us-east-1
        	--instance-ids i-0ed97c0e9ce54123a
        
      • Run modify-instance-attribute command (OSX/Linux/UNIX) to upgrade (resize) the selected instance to the required type (for example, c4.large type). If successful, no output is returned for the command:
        aws ec2 modify-instance-attribute
        	--region us-east-1
        	--instance-id i-0ed97c0e9ce54123a
        	--instance-type "{\"Value\": \"c4.large\"}"
        
      • Run start-instances command (OSX/Linux/UNIX) to restart the upgraded EC2 instance (it may take few minutes until the instance enters the running state). The start-instances command does not return an output:
        aws ec2 start-instances
        	--region us-east-1
        	--instance-ids i-0ed97c0e9ce54123a
        
      • If the selected worker instance cannot resume the processing of the available SQS messages after reboot or capacity upgrade, you may need to troubleshoot your worker application.
  2. If the consumer is a fleet of EC2 instance managed by an AWS Auto Scaling Group (ASG), perform the following:
    • Run describe-auto-scaling-groups command (OSX/Linux/UNIX) to describe the configuration of the EC2 instances (workers) attached to the specified AWS ASG:
      aws autoscaling describe-auto-scaling-groups
      	--region us-east-1
      	--auto-scaling-group-name CC-Worker-ASG
      	--query 'AutoScalingGroups[*].Instances[]'
      
    • The command output should return the requested workers configuration metadata:
      [
          {
              "ProtectedFromScaleIn": false,
              "AvailabilityZone": "us-east-1a",
              "InstanceId": "i-0a8a716eaa298908b",
              "HealthStatus": "Healthy",
              "LifecycleState": "InService",
              "LaunchConfigurationName": "CC-ASG-LC"
          },
          {
              "ProtectedFromScaleIn": false,
              "AvailabilityZone": "us-east-1b",
              "InstanceId": "i-0b0b34a09ec49524d",
              "HealthStatus": "Healthy",
              "LifecycleState": "Pending",
              "LaunchConfigurationName": "CC-ASG-LC"
          }
      ]
      
    • If the EC2 instance(s) HealthStatus attribute value is set to "Healthy" (as shown in the example above), the group instances are healthy and responsive, therefore the selected ASG is probably not having enough capacity to consume the required SQS messages. To increase the size of the ASG in order to handle the queue load, execute update-auto-scaling-group command (OSX/Linux/UNIX) using the desired and maximum number of EC2 instances that will run within the group as parameters (the command does not return an output):
      aws autoscaling update-auto-scaling-group
      	--region us-east-1
      	--auto-scaling-group-name CC-Worker-ASG
      	--desired-capacity 3
      	--max-size 3
      
    • If the ASG consumer cannot resume the SQS queue processing after the capacity upgrade, you may need to troubleshoot your worker application.
  3. If the SQS consumer is a Lambda function, perform the following:
    • If your Lambda function log streams do not contain errors (see Remediation/Resolution section part I, step 4.c. to analyze the required log streams), the Lambda function may not have enough resources to process the necessary AWS SQS messages. To increase the serverless consumer resources, you need to know first the current values allocated for these resources by executing get-function command (OSX/Linux/UNIX) with custom query filters:
      	aws lambda get-function
      		--region us-east-1
      		--function-name MySQSWorkerFunction
      		--query 'Configuration.[MemorySize,Timeout]'
      	
    • The command output should return the memory allocated to your function as first value and the processing timeout/limit (in seconds) as second value. For example, the following Lambda function can use 128 MB of memory and have maximum 10 seconds of compute time to process its code:
      	[
      	    128,
      	    10
      	]
      	
    • To increase the SQS consumer compute capacity, increase the size of the memory allocated for the selected function or increase the existing timeout value (seconds) by executing update-function-configuration command (OSX/Linux/UNIX). The following command example updates the memory size and timeout configuration attributes of an AWS Lambda function named MySQSWorkerFunction to 256 (MB) and 15 (seconds) using the --memory-size and --timeout parameters:
      	aws lambda update-function-configuration
      		--region us-east-1
      		--function-name MySQSWorkerFunction
      		--memory-size 256
      		--timeout 15
      	
    • The command output should return the new configuration for your Amazon Lambda function:
      	{
      	    "FunctionName": "MySQSWorkerFunction",
      	    "MemorySize": 256,
      	    "FunctionArn": "arn:aws:lambda:us-east-1:123456789012:
      	                    function:MySQSWorkerFunction",
      	    "Environment": {
      	        "Variables": {
      	            "queueUrl": "https://cloudconformity.com/transcode/"
      	        }
      	    },
      
      	    ...
      
      	    "Version": "$LATEST",
      	    "Role": "arn:aws:iam::123456789012:role/
      	             service-role/SQSWorkerRole",
      	    "Timeout": 15,
      	    "LastModified": "2016-09-04T12:41:59.468+0000",
      	    "Handler": "index.handler",
      	    "Runtime": "nodejs4.3"
      	}
      	
    • If the Lambda serverless worker cannot resume the SQS queue processing after the capacity (memory or timeout) upgrade, you may need to analyze and troubleshoot your consumer function.

03 Repeat step no. 1 and 2 for each SQS queue that has unresponsive or incapacitated consumers, available in the current AWS region.

04 Change the AWS region by updating the --region command parameter value and repeat the entire process for other regions.

References

Publication date Apr 3, 2016