If you’ve been following these blog series, we are looking at a BigData project to analyze Covid-19 data. So far, we have looked at the overall architecture and ETL Spark jobs. In this post, let’s look the scheduler component (Lambda function) in this workflow.
The main reason I’m using Lambda here is due to its serverless nature and native integration with other AWS services. For example, we could trigger it via a CloudWatch scheduled rule in regular basis or manually for backfilling/testing purpose. We can also create a test event in Lambda console to easily invoke and test its functionality. So, for this kind of use case, I think Lambda is an ideal option.
That being said, I’m again using Python (3.7 to be specific) to perform this task. Here’s the full code for this one. To summarize, it does the following:
- It takes several input parameters from environmental variables. These include EMR cluster name, S3 input location, script location, DynamoDB table, Glue DB/table, etc.
- When it gets triggered, it checks whether it has year/month/day present in the event. If yes, it treats as test/manual invocation and set the date accordingly. Else, it treats it as invocation from CloudWatch Rule and will set date to yesterday.
- After the date is determined, it checks whether there is any active EMR cluster or job-flow for the specified name. It only continues if a valid cluster is found and submits the Steps for ETL jobs. In case you’d like to use it as a transient cluster, we could also modify this code to launch a cluster that auto-terminates after all Steps are completed.
- For second ETL job, this Lambda function also adds the corresponding date-based partition into the input Glue table because it’ll be queried by the ETL2 job afterwards.
- Then, it crafts the Step parameters and submits them to the EMR cluster. While doing so, it also writes the item to DynamoDB table, which then triggers notifier Lambda function to push notification to Amazon Chime channel.
- Additionally, this function also contains several other logical checks to ensure that duplicate jobs aren’t submitted. For this, it checks the concerned item in DynamoDB table and the EMR Step status. This way, it prevents any duplicate submission and also tries to ensure job success.
- Another advantage of this kind of scheduler Lambda function is its ability to accommodate any number of ETL jobs at once with minimal code changes. Plus, it provides solid monitoring features with CloudWatch metrics and logs, which are highly essential for troubleshooting when something goes wrong.
Now, after the ETL Spark jobs are submitted to EMR cluster, they are executed sequentially or in parallel depending on Step Concurrency and YARN resources available in cluster. Similarly, the ETL jobs also contain logic to update record in DynamoDB table when the job starts running or completes. This then triggers the notification flow and notifies in Chime channel. Therefore, we can easily view and trace what’s happening in our workflow.
Next, I’m visualizing the resulting data to create a Dashboard using QuickSight and Athena. These components mostly involve work in QuickSight console. So, I’ll instead prepare walk-through or tutorial videos to cover this. Once published in Youtube, I’ll link these videos in these blog posts for reference.
So, that’s the high level overview of this project. I hope you found this useful and learned something new. If you’ve any question or feedback, feel free to drop a comment below. Have a good day ahead! Stay safe and have fun!