In this lesson, you will schedule the two acts that you've written previously to run automatically, one of two that will run at a time based schedule once every hour. The other Dag will follow the data aware schedule as soon as new embeddings are ingested into the vector database. Let's go. Running gags manually from the airflow UI is great when developing them and for ad hoc workflows. But most of the time, you want your pipelines to run automatically. If you imagine you work at a bookstore, new books will be added all the time, and you'd want the pipeline to run to create embeddings from descriptions of newly added books so your customers can find them. Scheduling pipelines is one of the core functions of airflow, and there are many different options available to you. In this lesson, we'll go over the two most common types of schedules cron based and data aware. You could find a link to a guide covering all scheduling options in the resources section at the end of this notebook. In airflow, you can schedule each stack by adding Dag parameters to the ad stack. Decorator lets schedule the Dag that fetches the book data to run automatically at the top of every hour. To do so, you need to add the schedule argument to the attack decorator and set it to the cron string for one hour, which is zero, followed by four stars. Or you can use the airflow shorthand for this cron string, which is at hourly. You also need to give your Dag a start date. After this date, the Dag can run based on its schedule. Always make sure your Dax start date is in the past. After running the cell to save the change to the Dax folder. You can see the schedule displayed in the airflow UI, both in the Dax list and on the individual Dag. Note that a change of the Dax schedule is a structural change, and therefore creates a new Dag version. Now, as long as the Dag is unpaused, that is, the blue toggle is set to on. It will automatically run at the top of every hour. Time based schedules are very useful, but often you don't know exactly when you want a doc to run. Just that it should run based on the right data being ready. In our case, let's make sure the query data Dag runs as soon as the embeddings are loaded to the vector database. This is where data aware schedules based on airflow assets come in handy. An asset in airflow is an object that represents a real or abstract data object anywhere in your pipeline. It can represent a file in cloud storage, a table in a relational database, or in our case, a whole collection in V8. You, as the Dag, offer the site which tasks update which assets by assigning them to the tasks outlet's parameter. The task does not actually need to interact with the underlying data object, but in this example it does. Whenever this task completes successfully, an asset event will be created for the asset. The asset has been updated. A downstream Dag. In your case query data can be scheduled on one or more assets. Having received a new asset event. You could think of assets like little flags that you give your tasks to waves. Once they finish successfully, and the data aware schedule like telling a Dag to wait for the right combination of flags to have been raised. Let's give the last task in the fetch data Dag an asset to update. You can add assets using the outlet's task parameter. This parameter is available in any task, including traditional airflow operators. Set the outlets parameter to a list of assets that this task updates. In your case, the task loads embeddings to alleviate. Let's call this asset my book Vector Data. Looking at the Dag graph and the airflow UI, you can see the assets your Dag is connected to. By opening the options menu and selecting external conditions in the dependencies field, the load embeddings to vector db task in the Fetch data dock now updates the May book Vector Data asset. Great. Now let's schedule the query data Dag based on updates to this asset in the attack decorator of the query data Dag. Add the schedule argument and set it to the asset. This stack should be waiting for. Make sure that you use the exact same name my book vector data as in the asset added to the fetch data Dag. The name is used to identify the asset and connects the upstream Dag, which updates it to the downstream Dag scheduled on it. This connection is visible in the airflow UI. If you click on the assets button, you can see the My Book Vector Data asset in the assets list. Click on it to open the asset graph. It shows all the dags this asset is connected to. From this view, you can pause and unpause the Dax and navigate directly to the associated Dax by clicking on the name. You can also manually create an asset event using the blue button in the top right corner of the screen. The two options available are materialize, triggering the Dag upstream of the asset to run and if the load embeddings to vector DB task is successful, create an asset event or manual which just creates the asset event that causes the downstream Dag to run. The latter option is very useful when developing to test your asset schedules. Note that asset events can also be created from outside of airflow using the Airflow Rest API. Nice. You two Dax are now linked together through a data aware dependency. When writing complex airflow pipelines. It is common to have many DAGs linked by many assets so that they all run as soon as the relevant data has been updated like dominoes. There are many more Dag parameters other than schedule you can use to modify the behavior of you Dax. You can find a link to a full list in the resources section. One more Dag parameter that is useful for your query data. Dag is the params parameter. It allows you to modify the form shown when manually triggering a Dag to ask for a specific input. Let's use this parameter to allow our users to input a different value for the query string parameter. Whenever they manually run the Dag in the airflow UI in the app back decorator, you can add parents as a parameter and set it to the Dictionary of Parents, where you include a default value for each input. When running the Dag manually, your users will be able to override this default value. The values from your params dictionary are stored in the airflow context. A dictionary containing information about your airflow Dag run that can be accessed inside of airflow tasks. To access the context. Put stars star context in the definition of your airflow task. In your case, this will replace the query string argument, but the airflow context can also be used alongside any number of task arguments inside the task function. Use the params key on the context variable and index into it with your param name. In this case, query string. The value for query string is then used by the rest of the task to perform the near vector search. Make sure to also remove the hardcoded value for query string from the task call. After saving the changes, the airflow UI now displays a field for the query string. When manually running the Dag, I think I want to read a book about sleep. Let's use that as an input. Checking the logs of the Dag run. You can see that the task now recommends a different book about lucid dreaming. Perfect. If you want to add your own book file for this lesson with book descriptions to query against, run the help cell. Awesome! You docs are now scheduled and your query data Dag includes a handy option for users to query for different books from the airflow UI. The docs are almost ready for production. In the next lesson, we'll focus on how to parallelize this pipeline even more to make it more efficient and easier to troubleshoot.

Please sign in to view this content

Learn Code

Next Lesson

Orchestrating Workflows for GenAI Applications

Introduction
Video
・
3 mins

From Notebook To Pipeline
Video
・
9 mins

Your RAG Prototype
Video with Code Example
・
8 mins

Building a Simple Pipeline
Video with Code Example
・
11 mins

Turning your Prototype into a Pipeline
Video with Code Example
・
9 mins

Scheduling and Dag Parameters
Video with Code Example
・
10 mins

Make the Pipeline Adaptable 
Video with Code Example
・
11 mins

Prepare to Fail
Video with Code Example
・
11 mins

GenAI pipelines in Real Life
Video
・
6 mins

Conclusion
Video
・
1 min

Quiz

Graded・Quiz

・

10 mins

Optional: How to Set up a Local Airflow Environment
Video
・
3 mins

Appendix - Resources, Help, and Downloads
Code Example
・
10 mins

Course Feedback

Forum