emr>: Amazon Elastic Map Reduce¶
emr> operator can be used to run EMR jobs, create clusters and submit steps to existing clusters.
For detailed information about EMR, see the Amazon Elastic MapReduce Documentation.
+emr_job:
emr>:
cluster:
name: my-cluster
ec2:
key: my-ec2-key
master:
type: m3.2xlarge
core:
type: m3.xlarge
count: 10
logs: s3://my-bucket/logs/
staging: s3://my-bucket/staging/
steps:
- type: spark
application: pi.py
- type: spark-sql
query: queries/query.sql
result: s3://my-bucket/results/${session_uuid}/
- type: script
script: scripts/hello.sh
args: [hello, world]
Secrets¶
When you set those parameters, use digdag secrets command.
aws.emr.access_key_id, aws.access_key_id
The AWS Access Key ID to use when submitting EMR jobs.
aws.emr.secret_access_key, aws.secret_access_key
The AWS Secret Access Key to use when submitting EMR jobs.
aws.emr.role_arn, aws.role_arn
The AWS Role to assume when submitting EMR jobs.
aws.emr.region, aws.region
The AWS region to use for EMR service.
aws.emr.endpoint
The AWS EMR endpoint address to use.
aws.s3.region, aws.region
The AWS region to use for S3 service to store staging files.
aws.s3.endpoint
The AWS S3 endpoint address to use for staging files.
aws.kms.region, aws.region
The AWS region to use for KMS service to encrypt variables passed to EMR jobs.
aws.kms.endpoint
The AWS KMS endpoint address to use for EMR variable encryption.
Options¶
cluster: STRING | OBJECT
Specifies either the ID of an existing cluster to submit steps to or the configuration of a new cluster to create.
Using an existing cluster:
cluster: j-7KHU3VCWGNAFL
Creating a new minimal ephemeral cluster with just one node:
cluster: ec2: key: my-ec2-key logs: s3://my-bucket/logs/
Creating a customized cluster with several hosts:
cluster: name: my-cluster auto_terminate: false release: emr-5.2.0 applications: - hadoop - spark - hue - zookeeper ec2: key: my-ec2-key subnet_id: subnet-83047402b master: type: m4.2xlarge core: type: m4.xlarge count: 10 ebs: optimized: true devices: volume_specifiation: iops: 10000 size_in_gb: 1000 type: gp2 volumes_per_instance: 6 task: - type: c4.4xlarge count: 20 - type: g2.2xlarge count: 6 logs: s3://my-bucket/logs/ bootstrap: - install_foo.sh - name: Install Bar path: install_bar.sh args: [baz, quux]
staging: S3_URI
A S3 folder to use for staging local files for execution on the EMR cluster. Note: the configured AWS credentials must have permission to put and get objects in this folder.
Examples:
staging: s3://my-bucket/staging/
emr.region
The AWS region to use for EMR service.
emr.endpoint
The AWS EMR endpoint address to use.
s3.region
The AWS region to use for S3 service to store staging files.
s3.endpoint
The AWS S3 endpoint address to use for staging files.
kms.region
The AWS region to use for KMS service to encrypt variables passed to EMR jobs.
kms.endpoint
The AWS KMS endpoint address to use for EMR variable encryption.
steps: LIST
A list of steps to submit to the EMR cluster.
steps: - type: flink application: flink/WordCount.jar
- type: hive script: queries/hive-query.q vars: INPUT: s3://my-bucket/data/ OUTPUT: s3://my-bucket/output/ hiveconf: hive.support.sql11.reserved.keywords: false - type: spark application: spark/pi.scala - type: spark application: s3://my-bucket/spark/hello.py args: [foo, bar] - type: spark application: spark/hello.jar class: com.example.Hello jars: - libhello.jar - s3://td-spark/td-spark-assembly-0.1.jar conf: spark.locality.wait: 5s spark.memory.fraction: 0.5 args: [foo, bar] - type: spark-sql query: spark/query.sql result: s3://my-bucket/results/${session_uuid}/ - type: script script: s3://my-bucket/scripts/hello.sh args: [hello, world] - type: script script: scripts/hello.sh args: [world] - type: command command: echo args: [hello, world]
action_on_failure: TERMINATE_JOB_FLOW | TERMINATE_CLUSTER | CANCEL_AND_WAIT | CONTINUE
The action EMR should take in response to a job step failing.
Output parameters¶
emr.last_cluster_id
The ID of the cluster created. If a pre-existing cluster was used, this parameter will not be set.