/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.10.3-CAPI-1.16.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
warnings.warn(
/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%reload_ext autoreload%autoreload 2
Load Target Country From DHS data
# Set country-specific variablescountry_osm ='philippines'ookla_year =2019nightlights_year =2017
Make this Notebook Trusted to load map: File -> Trust Notebook
dhs_gdf.head()
DHSCLUST
Wealth Index
DHSID
DHSCC
DHSYEAR
CCFIPS
ADM1FIPS
ADM1FIPSNA
ADM1SALBNA
ADM1SALBCO
...
DHSREGCO
DHSREGNA
SOURCE
URBAN_RURA
LATNUM
LONGNUM
ALT_GPS
ALT_DEM
DATUM
geometry
0
1
-31881.608696
PH201700000001
PH
2017.0
NULL
NULL
NULL
NULL
NULL
...
15.0
ARMM
GPS
R
6.674652
122.109807
9999.0
10.0
WGS84
POLYGON ((122.09894 6.68544, 122.12067 6.68544...
1
2
-2855.375000
PH201700000002
PH
2017.0
NULL
NULL
NULL
NULL
NULL
...
15.0
ARMM
GPS
R
6.662256
122.132027
9999.0
5.0
WGS84
POLYGON ((122.12116 6.67305, 122.14289 6.67305...
2
3
-57647.047619
PH201700000003
PH
2017.0
NULL
NULL
NULL
NULL
NULL
...
15.0
ARMM
GPS
R
6.621822
122.179496
9999.0
47.0
WGS84
POLYGON ((122.16863 6.63261, 122.19036 6.63261...
3
4
-54952.666667
PH201700000004
PH
2017.0
NULL
NULL
NULL
NULL
NULL
...
15.0
ARMM
GPS
R
6.485298
122.137965
9999.0
366.0
WGS84
POLYGON ((122.12710 6.49609, 122.14883 6.49609...
4
6
-80701.695652
PH201700000006
PH
2017.0
NULL
NULL
NULL
NULL
NULL
...
15.0
ARMM
GPS
R
6.629457
121.916094
9999.0
151.0
WGS84
POLYGON ((121.90523 6.64025, 121.92696 6.64025...
5 rows × 22 columns
Set up Data Access
# Instantiate data managers for Ookla and OSM# This auto-caches requested data in RAM, so next fetches of the data are faster.osm_data_manager = OsmDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")ookla_data_manager = OoklaDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")
# Log-in using EOG credentialsusername = os.environ.get('EOG_USER',None)username = username if username isnotNoneelseinput('Username?')password = os.environ.get('EOG_PASSWORD',None)password = password if password isnotNoneelse getpass.getpass('Password?') # set save_token to True so that access token gets stored in ~/.eog_creds/eog_access_tokenaccess_token = nightlights.get_eog_access_token(username,password, save_token=True)
2023-01-31 14:01:05.870 | INFO | povertymapping.nightlights:get_eog_access_token:48 - Saving access_token to ~/.eog_creds/eog_access_token
2023-01-31 14:01:05.873 | INFO | povertymapping.nightlights:get_eog_access_token:56 - Adding access token to environmentt var EOG_ACCESS_TOKEN
Generate Base Features
If this is your first time running this notebook for this specific area, expect a long runtime for the following cell as it will download and cache the ff. datasets from the internet.
OpenStreetMap Data from Geofabrik
Ookla Internet Speed Data
VIIRS nighttime lights data from NASA EOG
On subsequent runs, the runtime will be much faster as the data is already stored in your filesystem.
%%timecountry_data = dhs_gdf.copy()# Add in OSM featurescountry_data = osm.add_osm_poi_features(country_data, country_osm, osm_data_manager)country_data = osm.add_osm_road_features(country_data, country_osm, osm_data_manager)# Add in Ookla featurescountry_data = ookla.add_ookla_features(country_data, 'fixed', ookla_year, ookla_data_manager)country_data = ookla.add_ookla_features(country_data, 'mobile', ookla_year, ookla_data_manager)# Add in the nighttime lights featurescountry_data = nightlights.generate_nightlights_feature(country_data, '2017')
2023-01-31 14:01:06.394 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for philippines at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/philippines? True
2023-01-31 14:01:06.395 | DEBUG | povertymapping.osm:load_pois:149 - OSM POIs for philippines being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/philippines/gis_osm_pois_free_1.shp
2023-01-31 14:01:42.655 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for philippines at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/philippines? True
2023-01-31 14:01:42.659 | DEBUG | povertymapping.osm:load_roads:168 - OSM Roads for philippines being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/philippines/gis_osm_roads_free_1.shp
2023-01-31 14:04:49.021 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: []
2023-01-31 14:04:50.124 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/2f858b388182d50703550c8ef9d321df.csv? True
2023-01-31 14:04:50.126 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, fixed 2019 (key: 2f858b388182d50703550c8ef9d321df) found in filesystem. Loading in cache.
2023-01-31 14:04:51.440 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: ['2f858b388182d50703550c8ef9d321df']
2023-01-31 14:04:51.442 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/5a45dc45080a935951e6c2b6c0052b13.csv? True
2023-01-31 14:04:51.443 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, mobile 2019 (key: 5a45dc45080a935951e6c2b6c0052b13) found in filesystem. Loading in cache.
2023-01-31 14:04:52.525 | INFO | povertymapping.nightlights:get_clipped_raster:414 - Retrieving clipped raster file /home/jc_tm/.geowrangler/nightlights/clip/295bf47ce6753c7f06ab79012b769f2a.tif
CPU times: user 3min 19s, sys: 24.5 s, total: 3min 43s
Wall time: 3min 53s
# Split train/test data into features and labels# For labels, we just select the target label columnlabels = country_data[[label_col]]# For features, drop all columns from the input country geometries# If you need the cluster data, refer to country_data / country_testinput_dhs_cols = dhs_gdf.columnsfeatures = country_data.drop(input_dhs_cols, axis=1)features.shape, labels.shape
((1213, 61), (1213, 1))
# Clean features# For now, just impute nans with 0# TODO: Implement other cleaning stepsfeatures = features.fillna(0)
Base Features List
The features can be subdivided by the source dataset
OSM
<poi type>_count: number of points of interest (POI) of a specified type in that area
ex. atm_count: number of atms in cluster
poi_count: number of all POIs of all types in cluster
<poi_type>_nearest: distance of nearest POI of the specified type
ex. atm_nearest: distance of nearest ATM from that cluster
OSM POI types included: atm, bank, bus_stations, cafe, charging_station, courthouse, dentist (clinic), fast_food, fire_station, food_court, fuel (gas station), hospital, library, marketplace, pharmacy, police, post_box, post_office, restaurant, social_facility, supermarket, townhall, road
Ookla
The network metrics features follow the following name convention:
Performing 5-fold CV...
<generator object _RepeatedSplits.split at 0x7f3d8cc795f0>
Instantiate model
For now, we will train a simple random forest model
from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(n_estimators=100, random_state=train_test_seed, verbose=0)model
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
Evaluate model training using cross-validation
We evalute the model’s generalizability when training over different train/test splits
Ideally for R^2
We want a high mean: This means that we achieve a high model performance over the different train/test splits
We want a low standard deviation (std): This means that the model performance is stable over multiple training repetitions
For training the final model, we train on all the available data.
model.fit(features.values, labels.values.ravel())
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.