Blog

Scheduling AWS Tasks using Data Pipeline Service

=============================================== Topic also covers… Scheduling AWS Tasks using AWS Data Pipeline Service Scheduling EC2 Instances without using thrird party software Using AWS Data Pipeline to Schedule tasks How to schedule CLI […]

Vikrant Sundriyal | 9 January , 2016 , 3 years ago

Scheduling AWS Tasks using Data Pipeline Service

===============================================

Topic also covers…

Scheduling AWS Tasks using AWS Data Pipeline Service

Scheduling EC2 Instances without using thrird party software

Using AWS Data Pipeline to Schedule tasks

How to schedule CLI commands and Custom Scripts in AWS

===============================================

With the increase in AWS Services demand, the necessity to manage AWS recourses in more efficient and economical way is also increasing.

This Blog is intended to explain how AWS data pipeline service can be used to automate tasks which can result in cost optimization.

Am using an example where EC2 instances will be stopped at a certain time and restarted again with the help of data pipeline. This can be implemented to reduce EC2 instance cost by bringing down instances during non working hours. Similar automations can be done to achieve other optimizations. You can also execute your own custom script with the help of data pipeline. e.g if you wish to take periodic backups (Images), you can write a CLI command and schedule job in your preferred time.

Note: There are many third party tools available for performing similar tasks but for those solutions you need to either share credentials or you need to have an instance running which takes care of scheduling. In case of Data Pipeline, an instance starts only when task is scheduled and terminates instantly after execution.

Below are the steps given for Data Pipeline configuration:

  1. Login to AWS console and go to Data Pipeline Service.
  2. Click on Create new Pipeline
    • Under Source select Command Line Interface (CLI)
    • Provide valid CLI command Under Parameters. e.g to Stop few instances I used below command:

aws ec2 stop-instances –instance-ids i-id1 i-id2 –region                         <<your region>>

                      Note you can add more instance IDs, change region as per your                          Instance region

    • Select execution time, and logs location.

Similar steps can be followed for Scheduling Instance start. Just command given in step 4 will be different and shown below:

aws ec2 start-instances –instance-ids i-id1 i-id2 –region <<your region>>

Any complex command can be scheduled following above steps.

Below is a command to Stop all instances running under an account in a particular region. Instead of hardcoding instance ids am querying all running instances and stopping them.

aws ec2 describe-instances –region <<your region>> –filter Name=instance-state-name,Values=running –query ‘Reservations[].Instances[].InstanceId’ –output text | xargs aws ec2 stop-instances –instance-ids –region <<your region>>

3) You can give Role as DataPipelineDefaultResourceRole. One important point to consider is to check the default policy attached with this Role in IAM. There are chances that the policy attached doesn’t include the rights to perform action you wish via CLI. e.g By default ec2 stop instance will not work. In this case you can create a custom policy with required rights and attach to DataPipelineDefaultResourceRole role.

Sample policy:

{
     "Version": "2012-10-17",
     "Statement": [
          {
               "Effect": "Allow",
               "Action": [
                    "s3:*",
                    "ec2:Describe*",
                    "ec2:Start*",
                    "ec2:RunInstances",
                    "ec2:Stop*",
                    "datapipeline:*",
                    "cloudwatch:*"
               ],
               "Resource": [
                    "*"
               ]
          }
     ]
}

Conclusions:

  • Data Pipeline can be used to automated CLI commands or custom scripts.
  • Its cost effective as scheduler start and termination is handled by AWS. An instance starts for execution only at scheduled time and terminates post execution.
  • Its secure compared to other third party scheduler tools which force to upload credentials.