Welcome to the MovingPandas & DVC Session @ OpenGeoHub 2023!¶

This tutorial relies heavily on DVC and MovingPandas.
This tutorial consists of three main parts:
- Tracking a dataset with DVC
- Implementing a MovingPandas analysis
- Tracking our analysis workflow with a DVC pipeline
Setup¶
Make sure to follow the instructions in the README.md to set up your Python environment.
Tracking datasets¶
In this first part, we will initialize DVC and configure it to keep track of our mobilty dataset.
Initializing DVC¶
To initialize DVC in the 0-opengeohub-session\start
directory, run:
dvc init --subdir
This will create a .dvc
directory and a .dvcignore
file and you should see the output:
Initialized DVC repository.
You can now commit the changes to git.
Downloading a dataset¶
Next, we will download our tutorial dataset, a CSV file containing boat locations:
dvc get https://github.com/movingpandas/movingpandas-examples data/boat-positions.csv -o data\boat-positions.csv
Start tracking the dataset¶
Let's track the data\boat-positions.csv
file:
dvc add .\data\boat-positions.csv
To enable auto staging, run:
dvc config core.autostage true
To track the changes with GIT, run:
git add 'data\.gitignore' 'data\boat-positions.csv.dvc' '.dvc\config'
Let's check the status:
dvc status
This should output:
Data and pipelines are up to date.
Finally, let's commit the initialized DVC setup to GIT:
git commit -m "Add dvc"
As a confirmation, this will display the new DVC configuration files, including the boat-positions.csv.dvc
which is the placeholder for our dataset. Instead of pushing the whole dataset to the GIT repo, only the placeholder is included. This ensures that the GIT repo is not flooded with (potentially huge) datasets:
[opengeohub2023 a59662e] Add dvc
5 files changed, 12 insertions(+)
create mode 100644 0-opengeohub-session/start/.dvc/.gitignore
create mode 100644 0-opengeohub-session/start/.dvc/config
create mode 100644 0-opengeohub-session/start/.dvcignore
create mode 100644 0-opengeohub-session/start/data/.gitignore
create mode 100644 0-opengeohub-session/start/data/boat-positions.csv.dvc
Handling dataset modifications¶
Let's clean up the column names. Change the header in data\boat-positions.csv
to
id,t,lon,lat
Save the changes and let's check the DVC status again:
dvc status
This will output:
data\boat-positions.csv.dvc:
changed outs:
modified: data\boat-positions.csv
Let's commit our changes to DVC:
dvc commit
This will as for our confirmation
outputs ['data\\boat-positions.csv'] of stage: 'data\boat-positions.csv.dvc' changed. Are you sure you want to commit it? [y/n]
When we check the DVC status again:
dvc status
We now get:
Data and pipelines are up to date.
Next, let's check the GIT status:
git status
Since the status shows that data\boat-positions.csv.dvc
has changed, we should git add it:
git add .\data\boat-positions.csv.dvc
Then we can commit the new data\boat-positions.csv.dvc
to GIT:
git commit -m "Update header"
Undoing changes¶
To revert our changes and go back to the previous file version, run:
git checkout HEAD~1 .\data\boat-positions.csv.dvc
dvc checkout
Which will show that the CSV file has been modified:
M data\boat-positions.csv
Checking the DVC status:
dvc status
Shows:
Data and pipelines are up to date.
When we look at the CSV file now, the header has reverted back to the original.
To return to the latest version with our nice short column names, change HEAD~1
to HEAD
and run:
git checkout HEAD .\data\boat-positions.csv.dvc
dvc checkout
Which will again confirm that the CSV has been changed:
M data\boat-positions.csv
To find the correct version of a file, we can have a look at the GIT commit log:
git log --oneline
Which outputs the log similar to:
ab17c8f (HEAD -> opengeohub2023) Update header
207a496 Add dvc
You may also use the hash (e.g. 207a496
instead of HEAD~1
) to access a specific commit directly.
Setting up a data pipeline¶
For this tutorial, we will implement a stop extraction analysis using MovingPandas. For the development of this analysis from scratch, head over to solution/notebook.ipynb.
After we have decided how our analysis should work, we can automate it and track it using a DVC data pipeline.
To do so, first, we need a script that implements the data processing:
Creating a first analysis script¶
Let's create a small analysis script called extract-stops.py
that extracts stops from the boat trajectories:
import pandas as pd
import movingpandas as mpd
from datetime import timedelta
import warnings
warnings.filterwarnings('ignore')
def run():
print("Reading data ...")
df = pd.read_csv("./data/boat-positions.csv", sep=",")
df['t'] = pd.to_datetime(df['t'], format='%d/%m/%Y %H:%M')
print("Creating trajectories ...")
tc = mpd.TrajectoryCollection(df, traj_id_col="id", t="t", x="lon", y="lat")
print("Extracting stops ...")
stop_detector = mpd.TrajectoryStopDetector(tc, n_threads=3)
stops = stop_detector.get_stop_points(max_diameter=1000, min_duration=timedelta(hours=1))
print(stops)
print("Saving results ...")
stops.to_file('stops.geojson', driver='GeoJSON')
print("SUCCESS! Created output stops.geojson")
if __name__ == '__main__':
run()
When we run the script using:
python .\extract-stops.py
This should output:
Reading data ...
Creating trajectories ...
Extracting stops ...
geometry start_time end_time traj_id duration_s
stop_id
2_2021-03-21 08:29:00 POINT (32.35567 31.21248) 2021-03-21 08:29:00 2021-03-21 23:11:00 2 52920.0
3_2021-03-24 04:15:00 POINT (32.33328 31.42789) 2021-03-24 04:15:00 2021-03-24 06:24:00 3 7740.0
4_2021-03-23 22:23:00 POINT (32.32796 31.39507) 2021-03-23 22:23:00 2021-03-24 12:46:00 4 51780.0
5_2021-03-20 10:15:00 POINT (32.35727 31.21790) 2021-03-20 10:15:00 2021-03-21 03:55:00 5 63600.0
6_2021-03-23 23:59:00 POINT (32.39495 30.34766) 2021-03-23 23:59:00 2021-03-24 02:26:00 6 8820.0
... ... ... ... ... ...
250_2021-03-23 17:17:00 POINT (32.53383 29.83297) 2021-03-23 17:17:00 2021-03-24 12:21:00 250 68640.0
251_2021-03-23 22:21:00 POINT (32.35181 31.45093) 2021-03-23 22:21:00 2021-03-24 12:48:00 251 52020.0
252_2021-03-20 00:24:00 POINT (32.32610 31.45075) 2021-03-20 00:24:00 2021-03-20 02:54:00 252 9000.0
255_2021-03-23 08:52:00 POINT (32.57538 29.85072) 2021-03-23 08:52:00 2021-03-24 11:14:00 255 94920.0
256_2021-03-20 00:25:00 POINT (32.32908 31.19485) 2021-03-20 00:25:00 2021-03-24 12:35:00 256 389400.0
[318 rows x 5 columns]
Saving results ...
SUCCESS! Created output stops.geojson
Configuring our first pipeline stage¶
Now, we can configure DVC to run our script. To do that, we create a DVC stage with the name stop-extraction
that uses our Python script and the CSV data to create the stop.geojson:
dvc stage add -n stop-extraction -d extract-stops.py -d data/boat-positions.csv -o stops.geojson python extract-stops.py
Which will be confirmed by
Added stage 'stop-extraction' in 'dvc.yaml'
For more info check https://dvc.org/doc/start/data-management/data-pipelines#pipeline-stages
The dvc.yaml
should now look like this:
stages:
stop-extraction:
cmd: python extract-stops.py
deps:
- data/boat-positions.csv
- extract-stops.py
outs:
- stops.geojson
To run our new pipeline, we use the repro command:
dvc repro
Which will execute our one-stage pipeline:
'data\boat-positions.csv.dvc' didn't change, skipping
Running stage 'stop-extraction':
> python extract-stops.py
Reading data ...
Creating trajectories ...
Extracting stops ...
geometry start_time end_time traj_id duration_s
stop_id
2_2021-03-21 08:29:00 POINT (32.35567 31.21248) 2021-03-21 08:29:00 2021-03-21 23:11:00 2 52920.0
3_2021-03-24 04:15:00 POINT (32.33328 31.42789) 2021-03-24 04:15:00 2021-03-24 06:24:00 3 7740.0
4_2021-03-23 22:23:00 POINT (32.32796 31.39507) 2021-03-23 22:23:00 2021-03-24 12:46:00 4 51780.0
5_2021-03-20 10:15:00 POINT (32.35727 31.21790) 2021-03-20 10:15:00 2021-03-21 03:55:00 5 63600.0
6_2021-03-23 23:59:00 POINT (32.39495 30.34766) 2021-03-23 23:59:00 2021-03-24 02:26:00 6 8820.0
... ... ... ... ... ...
250_2021-03-23 17:17:00 POINT (32.53383 29.83297) 2021-03-23 17:17:00 2021-03-24 12:21:00 250 68640.0
251_2021-03-23 22:21:00 POINT (32.35181 31.45093) 2021-03-23 22:21:00 2021-03-24 12:48:00 251 52020.0
252_2021-03-20 00:24:00 POINT (32.32610 31.45075) 2021-03-20 00:24:00 2021-03-20 02:54:00 252 9000.0
255_2021-03-23 08:52:00 POINT (32.57538 29.85072) 2021-03-23 08:52:00 2021-03-24 11:14:00 255 94920.0
256_2021-03-20 00:25:00 POINT (32.32908 31.19485) 2021-03-20 00:25:00 2021-03-24 12:35:00 256 389400.0
[318 rows x 5 columns]
Saving results ...
SUCCESS! Created output stops.geojson
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
Note that if we run dvc repro
a second time, we get:
'data\boat-positions.csv.dvc' didn't change, skipping
Stage 'stop-extraction' didn't change, skipping
Data and pipelines are up to date.
This means that DVC knows that neither the input data nor the analysis script changed and, therefore, it is not necessary to re-run the stage.
When we run git status
, we see changes to DVC files and we are notified that we still need to add extract-stops.py to GIT:
On branch opengeohub2023-wip
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: .gitignore
new file: dvc.lock
new file: dvc.yaml
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
Untracked files:
(use "git add <file>..." to include in what will be committed)
extract-stops.py
So let's add extract-stops.py
and commit our changes.
git add extract-stops.py
git commit -m "First stage"
Making changes to the analysis script:¶
Next, let's remove to noisy print(stops)
statement from our script. When we save the changes and run dvc status
, we see:
stop-extraction:
changed deps:
modified: extract-stops.py
Now, if we run dvc repro
again, the stop-extraction stage is executed again:
'data\boat-positions.csv.dvc' didn't change, skipping
Running stage 'stop-extraction':
> python extract-stops.py
Reading data ...
Creating trajectories ...
Extracting stops ...
Saving results ...
SUCCESS! Created output stops.geojson
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
Again, let's commit our changes to GIT:
git add extract-stops.py
git commit -m "Remove print"
Making changes to the input data¶
If we make a change to the CSV file, dvc status
will tell us:
stop-extraction:
changed deps:
modified: data\boat-positions.csv
data\boat-positions.csv.dvc:
changed outs:
modified: data\boat-positions.csv
Since DVC recognizes this change, dvc repro
will know that it has to run the stage with the changed input data:
Verifying data sources in stage: 'data\boat-positions.csv.dvc'
Running stage 'stop-extraction':
> python extract-stops.py
Reading data ...
Creating trajectories ...
Extracting stops ...
Saving results ...
SUCCESS! Created output stops.geojson
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
And git status
will show us that the lock file and the CSV placeholder have been changed:
On branch opengeohub2023-wip
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: data/boat-positions.csv.dvc
modified: dvc.lock
Let's commit this:
git commit -m "Data change"
Our GIT log now looks somting like this:
git log --oneline
98a9db2 (HEAD -> opengeohub2023-wip) Data change
b04a4cc Remove print
96b89cb First stage
e3148a2 Update header
6ff13e0 Add dvc
Reverting changes¶
If we now decide to revert the changes in our CSV file and run dvc repro
again, we get:
Verifying data sources in stage: 'data\boat-positions.csv.dvc'
Stage 'stop-extraction' is cached - skipping run, checking out outputs
Updating lock file 'dvc.lock'
Use `dvc push` to send your updates to remote storage.
Which means that DVC realized that we already computed these results previously and that it can extract them from its cache. This can save a lot of time, when we work with time-intensive analyses.
git status
shows that only the lock and placeholder files have changed:
On branch opengeohub2023-wip
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: data/boat-positions.csv.dvc
modified: dvc.lock