/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.10.3-CAPI-1.16.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
warnings.warn(
/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%reload_ext autoreload%autoreload 2
Load Target Country From DHS data
# Set country-specific variablescountry_osm ="east-timor"ookla_year =2019nightlights_year =2016
Make this Notebook Trusted to load map: File -> Trust Notebook
dhs_gdf.head()
DHSCLUST
Wealth Index
DHSID
DHSCC
DHSYEAR
CCFIPS
ADM1FIPS
ADM1FIPSNA
ADM1SALBNA
ADM1SALBCO
...
URBAN_RURA
LATNUM
LONGNUM
ALT_GPS
ALT_DEM
DATUM
F21
F22
F23
geometry
0
1
32166.600000
TL201600000001
TL
2016.0
TT
NULL
NULL
NULL
NULL
...
R
-8.712016
125.567381
9999.0
1005.0
WGS84
None
None
None
POLYGON ((125.55646 -8.70122, 125.57830 -8.701...
1
2
-34063.923077
TL201600000002
TL
2016.0
TT
NULL
NULL
NULL
NULL
...
R
-8.730226
125.590219
9999.0
1342.0
WGS84
None
None
None
POLYGON ((125.57930 -8.71943, 125.60114 -8.719...
2
3
39230.590909
TL201600000003
TL
2016.0
TT
NULL
NULL
NULL
NULL
...
R
-8.741340
125.556399
9999.0
1060.0
WGS84
None
None
None
POLYGON ((125.54548 -8.73055, 125.56732 -8.730...
3
4
-82140.227273
TL201600000004
TL
2016.0
TT
NULL
NULL
NULL
NULL
...
R
-8.811291
125.535161
9999.0
1986.0
WGS84
None
None
None
POLYGON ((125.52424 -8.80050, 125.54608 -8.800...
4
5
-56203.423077
TL201600000005
TL
2016.0
TT
NULL
NULL
NULL
NULL
...
R
-8.791590
125.473219
9999.0
1491.0
WGS84
None
None
None
POLYGON ((125.46230 -8.78080, 125.48414 -8.780...
5 rows × 25 columns
Set up Data Access
# Instantiate data managers for Ookla and OSM# This auto-caches requested data in RAM, so next fetches of the data are faster.osm_data_manager = OsmDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")ookla_data_manager = OoklaDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")
# Log-in using EOG credentialsusername = os.environ.get('EOG_USER',None)username = username if username isnotNoneelseinput('Username?')password = os.environ.get('EOG_PASSWORD',None)password = password if password isnotNoneelse getpass.getpass('Password?') # set save_token to True so that access token gets stored in ~/.eog_creds/eog_access_tokenaccess_token = nightlights.get_eog_access_token(username,password, save_token=True)
2023-01-31 14:00:38.706 | INFO | povertymapping.nightlights:get_eog_access_token:48 - Saving access_token to ~/.eog_creds/eog_access_token
2023-01-31 14:00:38.707 | INFO | povertymapping.nightlights:get_eog_access_token:56 - Adding access token to environmentt var EOG_ACCESS_TOKEN
Generate Base Features
If this is your first time running this notebook for this specific area, expect a long runtime for the following cell as it will download and cache the ff. datasets from the internet.
OpenStreetMap Data from Geofabrik
Ookla Internet Speed Data
VIIRS nighttime lights data from NASA EOG
On subsequent runs, the runtime will be much faster as the data is already stored in your filesystem.
%%timecountry_data = dhs_gdf.copy()# Add in OSM featurescountry_data = osm.add_osm_poi_features(country_data, country_osm, osm_data_manager)country_data = osm.add_osm_road_features(country_data, country_osm, osm_data_manager)# Add in Ookla featurescountry_data = ookla.add_ookla_features(country_data, 'fixed', ookla_year, ookla_data_manager)country_data = ookla.add_ookla_features(country_data, 'mobile', ookla_year, ookla_data_manager)# Add in the nighttime lights featurescountry_data = nightlights.generate_nightlights_feature(country_data, str(nightlights_year))
2023-01-31 14:00:38.851 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for east-timor at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/east-timor? True
2023-01-31 14:00:38.852 | DEBUG | povertymapping.osm:load_pois:149 - OSM POIs for east-timor being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/east-timor/gis_osm_pois_free_1.shp
2023-01-31 14:00:42.117 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for east-timor at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/east-timor? True
2023-01-31 14:00:42.118 | DEBUG | povertymapping.osm:load_roads:168 - OSM Roads for east-timor being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/east-timor/gis_osm_roads_free_1.shp
2023-01-31 14:00:43.655 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: []
2023-01-31 14:00:43.657 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/206a0323fa0e80f82339b66d0c859b4a.csv? True
2023-01-31 14:00:43.658 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, fixed 2019 (key: 206a0323fa0e80f82339b66d0c859b4a) found in filesystem. Loading in cache.
2023-01-31 14:00:43.871 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: ['206a0323fa0e80f82339b66d0c859b4a']
2023-01-31 14:00:43.873 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/209c2544788b8e2bdf4db4685c50e26d.csv? True
2023-01-31 14:00:43.874 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, mobile 2019 (key: 209c2544788b8e2bdf4db4685c50e26d) found in filesystem. Loading in cache.
2023-01-31 14:00:44.047 | INFO | povertymapping.nightlights:get_clipped_raster:414 - Retrieving clipped raster file /home/jc_tm/.geowrangler/nightlights/clip/b0d0551dd5a67c8eada595334f2655ed.tif
CPU times: user 8.07 s, sys: 127 ms, total: 8.2 s
Wall time: 8.27 s
# Split train/test data into features and labels# For labels, we just select the target label columnlabels = country_data[[label_col]]# For features, drop all columns from the input country geometries# If you need the cluster data, refer to country_data / country_testinput_dhs_cols = dhs_gdf.columnsfeatures = country_data.drop(input_dhs_cols, axis=1)features.shape, labels.shape
((455, 61), (455, 1))
# Clean features# For now, just impute nans with 0# TODO: Implement other cleaning stepsfeatures = features.fillna(0)
Base Features List
The features can be subdivided by the source dataset
OSM
<poi type>_count: number of points of interest (POI) of a specified type in that area
ex. atm_count: number of atms in cluster
poi_count: number of all POIs of all types in cluster
<poi_type>_nearest: distance of nearest POI of the specified type
ex. atm_nearest: distance of nearest ATM from that cluster
OSM POI types included: atm, bank, bus_stations, cafe, charging_station, courthouse, dentist (clinic), fast_food, fire_station, food_court, fuel (gas station), hospital, library, marketplace, pharmacy, police, post_box, post_office, restaurant, social_facility, supermarket, townhall, road
Ookla
The network metrics features follow the following name convention:
Performing 5-fold CV...
<generator object _RepeatedSplits.split at 0x7fc72dca7cf0>
Instantiate model
For now, we will train a simple random forest model
from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(n_estimators=100, random_state=train_test_seed, verbose=0)model
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
Evaluate model training using cross-validation
We evalute the model’s generalizability when training over different train/test splits
Ideally for R^2 - We want a high mean: This means that we achieve a high model performance over the different train/test splits - We want a low standard deviation (std): This means that the model performance is stable over multiple training repetitions
For training the final model, we train on all the available data.
model.fit(features.values, labels.values.ravel())
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.