![]() Data Interval: A property of each DAG run that represents the period of data that each task should operate on.To gain a better understanding of DAG scheduling, it's important that you become familiar with the following terms and parameters: See the Python documentation on the datetime package. To get the most out of this guide, you should have an existing knowledge of: ![]() For a video overview of these concepts, see Scheduling in Airflow webinar.Īll code used in this guide is available in the airflow-scheduling-tutorial repository. In this guide, you'll learn Airflow scheduling concepts and the different ways you can schedule a DAG with a focus on timetables. For more information about datasets, see Datasets and Data-Aware Scheduling in Airflow. Datasets, introduced in Airflow 2.4, let you schedule your DAGs on updates to a dataset rather than a time-based schedule. With timetables, you can now schedule DAGs to run at any time. Timetables, released in Airflow 2.2, allow users to create their own custom schedules using Python, effectively eliminating the limitations of cron. Historically, Airflow users scheduled their DAGs by specifying a schedule with a cron expression, a timedelta object, or a preset Airflow schedule. We have thousands of workers on our prod celery deployment, so not too worried about workers getting busy.One of the fundamental features of Apache Airflow is the ability to schedule jobs. I wired the airflow "execute_command" task into our celery deployment so that we don't have to run two separate celery deployments. We use celery for the airflow executors, but we also use celery outside of (pre-dating) airflow. There's a PR pending that should improve scheduler performance dramatically with larger task counts. We optimized for more smaller dags rather than really big dags. We are thinking about cleanup functions for the SQL database and the redis database.ĭag size (task count, really) will impact the scheduler and the UI. Yes it will create more logs, database i/o, etc. we'll probably test up to 50-60 concurrent dag runs and see what breaks. We set max_active_runs = 20 in the dag args, that limits the concurrency. it's a "worker" dag that pops a batch of work off a redis queue and then processes it with multiple steps. We're testing a dag right now that is schedule_interval = "* * * * *" aka 1min. I would like to know your thoughts on that too.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |