/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.10.3-CAPI-1.16.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
warnings.warn(
/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%reload_ext autoreload%autoreload 2
Load Target Country From DHS data
# Set country-specific variablescountry_osm ="cambodia"ookla_year =2019nightlights_year =2014
Make this Notebook Trusted to load map: File -> Trust Notebook
dhs_gdf.head()
DHSCLUST
Wealth Index
DHSID
DHSCC
DHSYEAR
CCFIPS
ADM1FIPS
ADM1FIPSNA
ADM1SALBNA
ADM1SALBCO
...
DHSREGCO
DHSREGNA
SOURCE
URBAN_RURA
LATNUM
LONGNUM
ALT_GPS
ALT_DEM
DATUM
geometry
0
1
-7443.192308
KH201400000001
KH
2014.0
CB
NULL
NULL
NULL
NULL
...
1.0
banteay mean chey
CEN
R
13.518676
103.028394
9999.0
11.0
WGS84
POLYGON ((103.01729 13.52947, 103.03949 13.529...
1
2
2622.678571
KH201400000002
KH
2014.0
CB
NULL
NULL
NULL
NULL
...
1.0
banteay mean chey
CEN
R
13.398398
102.953852
9999.0
23.0
WGS84
POLYGON ((102.94276 13.40919, 102.96495 13.409...
2
3
22167.920000
KH201400000003
KH
2014.0
CB
NULL
NULL
NULL
NULL
...
1.0
banteay mean chey
CEN
R
13.503451
102.996001
9999.0
13.0
WGS84
POLYGON ((102.98490 13.51424, 103.00710 13.514...
3
4
32241.826087
KH201400000004
KH
2014.0
CB
NULL
NULL
NULL
NULL
...
1.0
banteay mean chey
CEN
U
13.549399
103.071416
9999.0
14.0
WGS84
POLYGON ((103.06032 13.56019, 103.08252 13.560...
4
5
154111.500000
KH201400000005
KH
2014.0
CB
NULL
NULL
NULL
NULL
...
1.0
banteay mean chey
CEN
U
13.538865
103.028993
9999.0
15.0
WGS84
POLYGON ((103.01789 13.54966, 103.04009 13.549...
5 rows × 22 columns
Set up Data Access
# Instantiate data managers for Ookla and OSM# This auto-caches requested data in RAM, so next fetches of the data are faster.osm_data_manager = OsmDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")ookla_data_manager = OoklaDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")
# Log-in using EOG credentialsusername = os.environ.get('EOG_USER',None)username = username if username isnotNoneelseinput('Username?')password = os.environ.get('EOG_PASSWORD',None)password = password if password isnotNoneelse getpass.getpass('Password?') # set save_token to True so that access token gets stored in ~/.eog_creds/eog_access_tokenaccess_token = nightlights.get_eog_access_token(username,password, save_token=True)
2023-01-31 14:00:53.085 | INFO | povertymapping.nightlights:get_eog_access_token:48 - Saving access_token to ~/.eog_creds/eog_access_token
2023-01-31 14:00:53.086 | INFO | povertymapping.nightlights:get_eog_access_token:56 - Adding access token to environmentt var EOG_ACCESS_TOKEN
Generate Base Features
If this is your first time running this notebook for this specific area, expect a long runtime for the following cell as it will download and cache the ff. datasets from the internet.
OpenStreetMap Data from Geofabrik
Ookla Internet Speed Data
VIIRS nighttime lights data from NASA EOG
On subsequent runs, the runtime will be much faster as the data is already stored in your filesystem.
%%timecountry_data = dhs_gdf.copy()# Add in OSM featurescountry_data = osm.add_osm_poi_features(country_data, country_osm, osm_data_manager)country_data = osm.add_osm_road_features(country_data, country_osm, osm_data_manager)# Add in Ookla featurescountry_data = ookla.add_ookla_features(country_data, 'fixed', ookla_year, ookla_data_manager)country_data = ookla.add_ookla_features(country_data, 'mobile', ookla_year, ookla_data_manager)# Add in the nighttime lights featurescountry_data = nightlights.generate_nightlights_feature(country_data, str(nightlights_year))
2023-01-31 14:00:53.415 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for cambodia at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/cambodia? True
2023-01-31 14:00:53.421 | DEBUG | povertymapping.osm:load_pois:149 - OSM POIs for cambodia being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/cambodia/gis_osm_pois_free_1.shp
2023-01-31 14:01:00.722 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for cambodia at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/cambodia? True
2023-01-31 14:01:00.723 | DEBUG | povertymapping.osm:load_roads:168 - OSM Roads for cambodia being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/cambodia/gis_osm_roads_free_1.shp
2023-01-31 14:01:31.359 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: []
2023-01-31 14:01:31.365 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/37f570ebc130cb44f9dba877fbda74e2.csv? True
2023-01-31 14:01:31.368 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, fixed 2019 (key: 37f570ebc130cb44f9dba877fbda74e2) found in filesystem. Loading in cache.
2023-01-31 14:01:32.434 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: ['37f570ebc130cb44f9dba877fbda74e2']
2023-01-31 14:01:32.437 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/1128a917060f7bb88c0a6260ed457091.csv? True
2023-01-31 14:01:32.440 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, mobile 2019 (key: 1128a917060f7bb88c0a6260ed457091) found in filesystem. Loading in cache.
2023-01-31 14:01:33.358 | INFO | povertymapping.nightlights:get_clipped_raster:414 - Retrieving clipped raster file /home/jc_tm/.geowrangler/nightlights/clip/4791e78094ba7e323fd5814b3f094a84.tif
CPU times: user 48 s, sys: 1.64 s, total: 49.7 s
Wall time: 50.6 s
# Split train/test data into features and labels# For labels, we just select the target label columnlabels = country_data[[label_col]]# For features, drop all columns from the input country geometries# If you need the cluster data, refer to country_data / country_testinput_dhs_cols = dhs_gdf.columnsfeatures = country_data.drop(input_dhs_cols, axis=1)features.shape, labels.shape
((611, 61), (611, 1))
# Clean features# For now, just impute nans with 0# TODO: Implement other cleaning stepsfeatures = features.fillna(0)
Base Features List
The features can be subdivided by the source dataset
OSM
<poi type>_count: number of points of interest (POI) of a specified type in that area
ex. atm_count: number of atms in cluster
poi_count: number of all POIs of all types in cluster
<poi_type>_nearest: distance of nearest POI of the specified type
ex. atm_nearest: distance of nearest ATM from that cluster
OSM POI types included: atm, bank, bus_stations, cafe, charging_station, courthouse, dentist (clinic), fast_food, fire_station, food_court, fuel (gas station), hospital, library, marketplace, pharmacy, police, post_box, post_office, restaurant, social_facility, supermarket, townhall, road
Ookla
The network metrics features follow the following name convention:
Performing 5-fold CV...
<generator object _RepeatedSplits.split at 0x7fa2dc7525f0>
Instantiate model
For now, we will train a simple random forest model
from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(n_estimators=100, random_state=train_test_seed, verbose=0)model
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
Evaluate model training using cross-validation
We evalute the model’s generalizability when training over different train/test splits
Ideally for R^2 - We want a high mean: This means that we achieve a high model performance over the different train/test splits - We want a low standard deviation (std): This means that the model performance is stable over multiple training repetitions
For training the final model, we train on all the available data.
model.fit(features.values, labels.values.ravel())
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.