Validating your code ==================== .. contents:: :local: :depth: 2 Best practices -------------- Writing ETL code is meant to be open-ended, but there are some standards that should be followed to help keep the code clean and clear. Proper use of utility classes ***************************** There are extensive :ref:`abstract-classes` available for development. These can do things like download and unzip a file to disk (:ref:`tasks.base_tasks.DownloadUnzipTask`) and import a CSV on disk to a temporary table (:ref:`tasks.base_tasks.CSV2TempTableTask`). These classes should be used when possible to minimize specialized ETL code. In particular, these tasks save output to well-known locations so as to avoid redundantly running the same tasks. Clearly documented command that runs a WrapperTask to create everything *********************************************************************** Oftentimes an ETL will have to loop over a parameter to get all the data -- for example, if a dataset is available online year-by-year, it may make sense to write a single task that downloads one year's file, with an parameter specifying which year. :ref:`luigi.WrapperTask` is a way to make sure such tasks are executed for every relevant parameter programmatically. A powerful example of this can be found with :ref:`tasks.us.AllZillow`, which executes a :ref:`tasks.us.zillow.Zillow` task once for each geography level, year, and month in that year. A generic example of using a :ref:`luigi.WrapperTask`: .. code:: python from luigi import WrapperTask, Task, Parameter class MyTask(Task): ''' This task needs to be run for each possible `geog` ''' geog = Parameter() def run(self): pass def output(self): pass class MyWrapperTask(WrapperTask): ''' Execute `MyTask` once for each possible `geog`. ''' def requires(self): for geog in ('state', 'county', 'city'): yield MyTask(geog=geog) Use parameters only when necessary ********************************** Tasks are unique to their parameters. In other words, if a task is run once with a certain set of parameters, it will not be run again unless the output it generated is deleted. Therefore it's very important to not have parameters available in a Task's definition that do not affect its result. If you have such extraneous parameters, it would be possible to run a task redundantly. An example of this: .. code:: python from tasks.base_tasks import DownloadUnzipTask class MyBadTask(DownloadUnzipTask): goodparam = Parameter() badparam = Parameter() def url(self): return 'http://somesite/with/data/{}'.format(self.goodparam) :ref:`tasks.base_tasks.DownloadUnzipTask` will generate the location for a unique output file automatically based off of all its params, but ``badparam`` above doesn't actually affect the file being downloaded. That means if we change ``badparam`` we'll download the same file twice. Use default parameter values sparingly ************************************** The above bad practice is easily paired with setting default values for parameters. For example: .. code:: python from tasks.base_tasks import DownloadUnzipTask class MyBadTask(DownloadUnzipTask): ''' My URL doesn't depend on `badparam`! ''' goodparam = Parameter() badparam = Parameter(default='foo') def url(self): return 'http://somesite/with/data/{}'.format(self.goodparam) Now it's easy to simply forget that ``badparam`` even exists! But it still affects the output filename, making it noisy and less clear which parameters actually matter. Keep fewer than 1000 columns per table ************************************** Postgres has a hard limit on the number of columns. If you create a :ref:`tasks.base_tasks.TableTask` whose ``columns`` method returns a :ref:`OrderedDict` with much more than 1000 columns, the task will fail. In such cases, you'll want to split your :ref:`tasks.base_tasks.TableTask` into several pieces, likely pulling columns from the same :ref:`tasks.base_tasks.ColumnsTask`. There is no limit on the number of columns in a :ref:`tasks.base_tasks.ColumnsTask`. Each geometry column should have a unique ``geom_ref`` column with it ********************************************************************* When setting up a :ref:`tasks.base_tasks.ColumnsTask` for Geometries, make sure that you store a meaningful and unique ``geom_ref`` from the same table. * It is meaningful if it can be found as a way to refer to that geometry in data sources elsewhere -- for example, `FIPS codes `_ are meaningful references to county geometries in the USA. However, the automatically generated serial ``ogc_fid`` column from a Shapefile is not meaningful. * It is unique if that ``geom_ref`` column has an ID that is not duplicated by any other columns. For example: .. code:: python from tasks.base_tasks import ColumnsTask from tasks.meta import OBSColumn, GEOM_REF from luigi import Parameter class MyGeoColumnsTask(ColumnsTask): resolution = Parameter() def columns(self): geom = OBSColumn( id=self.resolution, type='Geometry') geomref = OBSColumn( id=self.resolution + '_id', # Make sure we have "+ '_id'"! type='Text', targets={geom: GEOM_REF}) return OrderedDict([ ('geom', geom), ('geomref', geomref) ]) No matter what ``resolution`` this Task is passed, it will generate a unique ID for both the ``geom`` and the ``geomref``. If the ``+ '+id'`` concatenation were missing, it would mean that the metadata model would not properly link geomrefs to the geometries they refer to. Specify section, subsection, source tags and license tags for all columns **************************************************** When defining your :ref:`tasks.meta.OBSColumn` objects in a :ref:`tasks.base_tasks.ColumnsTask` class, make sure each column is assigned a :ref:`tasks.meta.OBSTag` of ``type``, ``section``, ``subsection``, ``source``, and ``license``. Use shared tags from :ref:`tasks.tags` when possible, in particular for ``section`` and ``subsection``. Specify unit tags for all measure columns ***************************************** When defining a :ref:`tasks.meta.OBSColumn` that will hold a measurement, make sure to define a ``unit`` using a tag. This could be something like ``people``, ``money``, etc. There are standard units accessible in :ref:`tasks.tags`. Making sure ETL code works right -------------------------------- After having written an ETL, you'll want to double check all of the following to make sure the code is usable. Results and byproducts are being generated ****************************************** When you use :ref:`run-any-task` to run individual components: * Were any exceptions thrown? On what task were they thrown? With which arguments? * Are appropriate files being generated in the ``tmp`` folder? * Are tables being created in the relevant ``tmp`` schema? * Are tables and columns being added to the ``observatory.obs_table`` and ``observatory.obs_column`` metadata tables? Provided :ref:`tasks.base_tasks.TableTask` and :ref:`tasks.base_tasks.ColumnTask` classes were executed, it's wise to jump into the database and check to make sure entries were made in those tables. .. code:: shell make psql .. code:: sql SELECT COUNT(*) FROM observatory.obs_column WHERE id LIKE 'path.to.module.%'; SELECT COUNT(*) FROM observatory.obs_table WHERE id LIKE 'path.to.module.%'; SELECT COUNT(*) FROM observatory.obs_column_table WHERE column_id LIKE 'path.to.module%' AND table_id LIKE 'path.to.module%'; Delete old data to start from scratch to make sure everything works ******************************************************************* When using the proper utility classes, your data on disk, for example from downloads that are part of the ETL, will be saved to a file or folder ``tmp/module.name/ClassName_Args``. In order to make sure the ETL is reproduceable, it's wise to delete this folder or move it to another location after development, and re-run to make sure that the whole process can still run from start to finish. Making sure metadata works right -------------------------------- Checking the metadata works right is one of the more challenging components of QA'ing new ETL code. Regenerate the ``obs_meta`` table ********************************* The ``obs_meta`` table is a denormalized view of the underlying :ref:`metadata` objects that you've created when running tasks. You can force the regeneration of this table using :ref:`tasks.carto.OBSMetaToLocal` .. code:: shell make -- run carto OBSMetaToLocal --force Once the table is generated, you can take a look at it in SQL: .. code:: shell make psql If the metadata is working correctly, you should have more entries in ``obs_meta`` than before. If you were starting from nothing, there should be more than 0 rows in the table. .. code:: sql SELECT COUNT(*) FROM observatory.obs_meta; If you already had data, you can filter ``obs_meta`` to look for new rows with a schema corresponding to what you added. For example, if you added metadata columns and tables in ``tasks/mx/inegi``, you should look for columns with that schema: .. code:: sql SELECT COUNT(*) FROM observatory.obs_meta WHERE numer_id LIKE 'mx.inegi.%'; If nothing is appearing in ``obs_meta``, chances are you are missing some metadata: Have you defined and executed a proper :ref:`tasks.base_tasks.TableTask`? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can check to see if these links exist by checking ``obs_column_table``: .. code:: shell make psql .. code:: sql SELECT COUNT(*) FROM observatory.obs_column_table WHERE column_id LIKE 'my.schema.%' AND table_id LIKE 'my.schema.%'; If they don't exist, make sure that your Python code roughly corresponds to: .. code:: python from tasks.base_tasks import ColumnsTask, TableTask class MyColumnsTask(ColumnsTask): def columns(self): # Return OrderdDict of columns here class MyTableTask(TableTask): def table_timespan(self): # Return timespan here def requires(self): return { 'columns': MyColumnsTask() } def columns(self): return self.input()['columns'] def populate(self): # Populate the output table here Unless the :ref:`TableTask` returns some of the columns from :ref:`ColumnsTask` in its own ``columns`` method, the links will not be initialized properly. Finally, double check that you actually ran the :ref:`TableTask` using ``make -- run my.schema MyTableTask``. Are you defining ``geom_ref`` relationships properly? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In cases where a :ref:`TableTask` does not have its own geometries, at least one of the columns returned from its ``columns`` method needs to be in a ``geom_ref`` relationship. Here's an example: .. code:: python from collections import OrderedDict from tasks.base_tasks import ColumnsTask, TableTask from tasks.meta import OBSColumn, GEOM_REF class MyGeoColumnsTask(ColumnsTask): def columns(self): geom = OBSColumn( type='Geometry') geomref = OBSColumn( type='Text', targets={geom: GEOM_REF}) return OrderedDict([ ('geom', geom), ('geomref', geomref) ]) class MyColumnsTask(ColumnsTask): def columns(self): # Return OrderdDict of columns here class MyTableTask(TableTask): def table_timespan(self): # Return timespan here def requires(self): return { 'geom_columns': MyGeoColumnsTask(), 'data_columns': MyColumnsTask() } def columns(self): cols = OrderedDict() cols['geomref'] = self.input()['geom_columns']['geomref'] cols.update(self.input()['data_columns']) return cols def populate(self): # Populate the output table here The above code would ensure that all columns existing inside ``MyTableTask`` would be appropriately linked to any geometries that connect to ``geomref``. Do you have both the data and geometries in your table? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can check by running: .. code:: sql SELECT * FROM observatory.obs_table WHERE id LIKE 'my.schema.%'; If there is only one table and it has a null "the_geom" boundary, then you are missing a geometry table. For example: .. code:: sql SELECT * from observatory.obs_table WHERE id LIKE 'es.ine.five_year_population%'; .. code:: shell id | tablename | timespan | the_geom | description | version ----------------------------------------+----------------------------------------------+----------+----------+-------------+--------- es.ine.five_year_population_99914b932b | obs_24b656e9e23d1dac2c8ab5786a388f9bf0f4e5ae | 2015 | | | 5 (1 row) Notice that the_geom is empty. You will need to write a second :ref:`TableTask` with the following structure: .. code:: python class Geometry(TableTask): def table_timespan(self): # Return timespan here def requires(self): return { 'meta': MyGeoColumnsTask(), 'data': RawGeometry() } def columns(self): return self.input()['meta'] def populate(self): # Populate the output table here Regenerate and look at the Catalog -------------- Once :ref:`tasks.carto.OBSMetaToLocal` has been run, you can generate the catalog. .. code:: shell make catalog You can view the generated Catalog in a browser window by going to the IP and port address for the nginx process. The current processes are shown with ``docker-compose ps`` or ``make ps``. 1. Are there any nasty typos or missing data? * Variable names should be unique, human-readable, and concise. If the variable needs more in-depth definition, this should go in the "description" of the variable. 2. Does the nesting look right? Are there columns not nested? * Variables that are denominators should also have subcolumns of direct nested variables. * There may be repetitive nesting if a variable is nested under two denominators, which is fine. 3. Are sources and licenses populated for all measures? * A source and license :ref:`tasks.util.OBSTag` must be written for new sources and licenses 4. Is a table with a boundary/timespan matrix appearing beneath each measure? * If not, hardcode the sample latitude and longitude in :ref:`tasks.meta.catalog_lonlat`. Upload to a test CARTO server -------------- If you set a ``CARTODB_API_KEY`` and ``CARTODB_URL`` in your ``.env`` file, in the format: .. code:: shell CARTODB_API_KEY=your_api_key CARTODB_URL=https://username.carto.com You will now be able to upload your data and metadata to CARTO for previewing. .. code:: shell make sync