![]() Connect Airflow to ECRĪirflow communicates with the Docker repository by looking for connections with the type “docker” in its list of connections. From there, we set up Airflow to be able to communicate with our account’s ECR. Our Airflow cluster runs on EC2 instances so we gave those specific permissions to the IAM roles associated with those instances. The permissions Airflow needed were ecr:BatchCheckLayerAvailability, ecr:BatchGetImage, ecr:GetAuthorizationToken, and ecr:GetDownloadUrlForLayer. Next, we needed to give Airflow permissions to pull the image of the job from ECR. For more detailed information, AWS provides excellent tutorials: Creating a Repository and Pushing an Image. If you want someone to manage the repository they are pushing to as well, you’ll also need them to give them the ecr:CreateRepository permission. If you want to manage repositories yourself, that’s all you need. At the very least, someone pushing a container to ECR will need the permissions ecr:GetAuthorizationToken and ecr:PutImage. First, we needed to give the analysts access to ECR and have them push their container, so we gave the analyst access to ECR in IAM by adding a few policies. Once we had the image, we then needed to move that image into ECR. Instead, we helped the analyst move his scraper to a Docker container, creating something we could easily maintain. For example, one analyst wrote a web scraper with the Selenium web driver, and while it worked on his laptop, some of the system calls Selenium used were failing in Linux.ĭebugging each system call and finding a way to make each step of the scraper work in every environment we support would have required a significant up-front cost and left us with fragile code, requiring the same fixes the next time someone changed the code. If a job relied on system APIs, we couldn’t guarantee it would work the same on the Airflow cluster as it did on the developer’s laptop. Most of our analysts and data scientists work in OS X or Windows, while our Airflow cluster runs on Linux. While not all of the jobs we run with Airflow require Docker, there were a few jobs that needed the portability that Docker provides. This is written under the assumption you know the basics of Airflow and Docker, though not necessarily ECR. In this post, I will take you through what we did to make Airflow and ECR work together. One interesting hurdle has been getting Airflow’s provided DockerOperator to work with images on AWS’s hosted private Elastic Container repository (ECR). Airflow was a major improvement over our previous solution-running Windows Task Manager on analyst’s laptop and hoping it worked-but we’ve had to work through a few hurdles to get everything working. Last year, Lucid Software’s data science and analytics teams moved to Apache Airflow for scheduling tasks.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |