We've hit scaling issues at the k8s level, scheduling overhead in airflow, random race conditions deep in the airflow code, etc. We have had SO many headaches operating airflow over the years, and each time we invest in fixing the issue I feel more and more entrenched. ingestion = ingest > validate > publish > scrub PII > publish) so we really don't need all the flexibility that airflow provides. Our DAGs are all config-driven which populate a few different templates (e.g. We weren't aware of a great alternative when we started. Unfortunately the time to value would’ve been way too long to teach them what the orchestration layer should do vs not, why Prefect architecturally couples some concepts vs others, how to write Python, how to write good Python, then finally start developing code.We've also been running airflow for the past 2-3 years at a similar scale (~5000 dags, 100k+ task executions daily) for our data platform. If my teams skillset were more software focused vs a BI background, I would have likely gone with Prefect. Airflow wants to sit at the center of your data ecosystem since it’s an orchestration engine, and not being able to communicate with it via external services was a huge gap in functionality. The addition of the rest API in version 2 was a HUGE improvement. It gets the job done and not much else IMO. Made the architecture decision of executing all tasks as AWSBatchOperator tasks in containers, so migrating to another orchestrator in the future would be less painful if needed and decouples everything from Airflow as much as possible. I’m leading a small team of more junior engineers starting on a new project (none of them are familiar with any orchestration tool and come from a BI background), and I’m already familiar with Airflow’s quirks, so decided to just go with it. We needed incredibly basic functionality. I prefer Dagster over Airflow, but not by a wide margin tho.ĭecided on Airflow because we’re heavily invested in AWS and are using MWAA as a result. But it gets tough when you need to hit external sources, separate data teams, processing in python (vs warehouse) and if your company has a microservices arch, in those cases having a separate orchestrator will help. I think most companies can get away with keeping their pipeline in the code repo and scheduling with something like AP scheduler, Django cron, etc by keeping their executions in the warehouse, data pipelines efficient and idempotent. People had a lot of trouble understanding the execution/run times and when the next run will run (especially with non trivial schedules). When we switched (3 years ago) Airflow was the top dog.Īs long as we kept Airflow away from executing workload and use it only for scheduling it was fine. Python functions run with APScheduler as part of a Flask repo Python scripts running with window scheduler Which use case/project made you choose a data orchestrator? => The UI is also not super clear for dashboarding - however consuming the metrics we dashboard via Grafana.Īnd not data-focused, but because Temporal is actually just a workflow engine, we can easily utilize the backend w/ a separate set of agents to run other workflows we need to like onboarding users, monthly reporting, etc. And all of these actions easily exhaust modern observability data via metrics and traces.ĭownside is the Temporal core is a bit heavyweight - runs 4-5 services and needs a backend datastore, though easily integrates w/ any type of Postgres service. I have Python Temporal workers that run data validation jobs via Great Expectations. I have Go Temporal workers that run data transformation in-process via Benthos. We are a "little data" shop - much more focused on data availability and quality than big data analytics. Easily unit testable, easily integrates w/ SDKs - especially because you build workers to run workflow actions in Go, Python, Java or Javascript so it is low-effort to integrate a different language ecosystem if needed. I come from a SWE background, so I like that it is 100% code-first.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |