Development

Writing ETL tasks is pretty repetitive. In tasks.util are a number of functions and classes that are meant to make life easier through reusability.

Utility Functions

These functions are very frequently used within the methods of a new ETL task.

tasks.meta.current_session()

Returns the session relevant to the currently operating Task, if any. Outside the context of a Task, this can still be used for manual session management.

Abstract classes

These are the building blocks of the ETL, and should almost always be subclassed from when writing a new process.

Batteries included

Data comes in many flavors, but sometimes it comes in the same flavor over and over again. These tasks are meant to take care of the most repetitive aspects.

Running and Re-Running Pieces of the ETL

When doing local development, it’s advisable to run small pieces of the ETL locally to make sure everything works correctly. You can use the make -- run helper, documented in Run any task. There are several methods for re-running pieces of the ETL depending on the task and are described below:

Using --force during development

When developing with abstract-classes that offer a force parameter, you can use it to re-run a task that has already been run, ignoring and overwriting all output it has already created. For example, if you have a tasks.base_tasks.TempTableTask that you’ve modified in the course of development and need to re-run:

from tasks.base_tasks import TempTableTask
from tasks.meta import current_session

class MyTempTable(TempTableTask):

    def run(self):
        session = current_session()
        session.execute('''
           CREATE TABLE {} AS SELECT 'foo' AS mycol;
        ''')

Running make -- run path.to.module MyTempTable will only work once, even after making changes to the run method.

However, running make -- run path.to.module MyTempTable --force will force the task to be run again, dropping and re-creating the output table.

Deleting byproducts to force a re-run of parts of ETL

In some cases, you may have a luigi.Task you want to re-run, but does not have a force parameter. In such cases, you should look at its output method and delete whatever files or database tables it created.

Utility classes will put their file byproducts in the tmp folder, inside a folder named after the module name. They will put database byproducts into a schema that is named after the module name, too.

Update the ETL & metadata through version

When you make changes and improvements, you can increment the version method of tasks.base_tasks.TableTask, tasks.base_tasks.ColumnsTask and tasks.base_tasks.TagsTask to force the task to run again.