/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/geopandas/_compat.py:111: UserWarning: The Shapely GEOS version (3.10.3-CAPI-1.16.1) is incompatible with the GEOS version PyGEOS was compiled with (3.10.1-CAPI-1.16.0). Conversions between both will be slow.
warnings.warn(
/home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/env/lib/python3.9/site-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
%reload_ext autoreload%autoreload 2
Load Target Country From DHS data
# Set country-specific variablescountry_osm ="myanmar"ookla_year =2019nightlights_year =2015
Make this Notebook Trusted to load map: File -> Trust Notebook
dhs_gdf.head()
DHSCLUST
Wealth Index
DHSID
DHSCC
DHSYEAR
CCFIPS
ADM1FIPS
ADM1FIPSNA
ADM1SALBNA
ADM1SALBCO
...
DHSREGCO
DHSREGNA
SOURCE
URBAN_RURA
LATNUM
LONGNUM
ALT_GPS
ALT_DEM
DATUM
geometry
0
1
-52232.000000
MM201500000001
MM
2015.0
BM
NULL
NULL
NULL
NULL
...
8.0
Magway
GPS
R
20.058637
95.360081
107.0
105.0
WGS84
POLYGON ((95.34859 20.06943, 95.37157 20.06943...
1
2
130773.724138
MM201500000002
MM
2015.0
BM
NULL
NULL
NULL
NULL
...
12.0
Yangon
GPS
U
17.112398
96.045616
31.0
14.0
WGS84
POLYGON ((96.03432 17.12319, 96.05691 17.12319...
2
3
-4955.000000
MM201500000003
MM
2015.0
BM
NULL
NULL
NULL
NULL
...
10.0
Mon
GPS
R
16.507664
97.364236
4.0
2.0
WGS84
POLYGON ((97.35298 16.51846, 97.37549 16.51846...
3
4
47824.103448
MM201500000004
MM
2015.0
BM
NULL
NULL
NULL
NULL
...
1.0
Kachin
GPS
U
26.684519
96.283879
193.0
210.0
WGS84
POLYGON ((96.27180 26.69531, 96.29596 26.69531...
4
5
9434.482759
MM201500000005
MM
2015.0
BM
NULL
NULL
NULL
NULL
...
12.0
Yangon
GPS
R
16.866059
96.053499
12.0
5.0
WGS84
POLYGON ((96.04222 16.87685, 96.06478 16.87685...
5 rows × 22 columns
Set up Data Access
# Instantiate data managers for Ookla and OSM# This auto-caches requested data in RAM, so next fetches of the data are faster.osm_data_manager = OsmDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")ookla_data_manager = OoklaDataManager(cache_dir=settings.ROOT_DIR/"data/data_cache")
# Log-in using EOG credentialsusername = os.environ.get('EOG_USER',None)username = username if username isnotNoneelseinput('Username?')password = os.environ.get('EOG_PASSWORD',None)password = password if password isnotNoneelse getpass.getpass('Password?') # set save_token to True so that access token gets stored in ~/.eog_creds/eog_access_tokenaccess_token = nightlights.get_eog_access_token(username,password, save_token=True)
2023-01-31 13:58:23.564 | INFO | povertymapping.nightlights:get_eog_access_token:48 - Saving access_token to ~/.eog_creds/eog_access_token
2023-01-31 13:58:23.570 | INFO | povertymapping.nightlights:get_eog_access_token:56 - Adding access token to environmentt var EOG_ACCESS_TOKEN
Generate Base Features
If this is your first time running this notebook for this specific area, expect a long runtime for the following cell as it will download and cache the ff. datasets from the internet.
OpenStreetMap Data from Geofabrik
Ookla Internet Speed Data
VIIRS nighttime lights data from NASA EOG
On subsequent runs, the runtime will be much faster as the data is already stored in your filesystem.
%%timecountry_data = dhs_gdf.copy()# Add in OSM featurescountry_data = osm.add_osm_poi_features(country_data, country_osm, osm_data_manager)country_data = osm.add_osm_road_features(country_data, country_osm, osm_data_manager)# Add in Ookla featurescountry_data = ookla.add_ookla_features(country_data, 'fixed', ookla_year, ookla_data_manager)country_data = ookla.add_ookla_features(country_data, 'mobile', ookla_year, ookla_data_manager)# Add in the nighttime lights featurescountry_data = nightlights.generate_nightlights_feature(country_data, str(nightlights_year))
2023-01-31 13:58:23.766 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for myanmar at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/myanmar? True
2023-01-31 13:58:23.767 | DEBUG | povertymapping.osm:load_pois:149 - OSM POIs for myanmar being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/myanmar/gis_osm_pois_free_1.shp
2023-01-31 13:58:33.290 | INFO | povertymapping.osm:download_osm_country_data:187 - OSM Data: Cached data available for myanmar at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/myanmar? True
2023-01-31 13:58:33.297 | DEBUG | povertymapping.osm:load_roads:168 - OSM Roads for myanmar being loaded from /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/osm/myanmar/gis_osm_roads_free_1.shp
2023-01-31 14:00:36.799 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: []
2023-01-31 14:00:36.802 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/d72ec7e4d144b750e1c0950ecad081e0.csv? True
2023-01-31 14:00:36.803 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, fixed 2019 (key: d72ec7e4d144b750e1c0950ecad081e0) found in filesystem. Loading in cache.
2023-01-31 14:00:37.049 | DEBUG | povertymapping.ookla:load_type_year_data:68 - Contents of data cache: ['d72ec7e4d144b750e1c0950ecad081e0']
2023-01-31 14:00:37.050 | INFO | povertymapping.ookla:load_type_year_data:83 - Cached data available at /home/jc_tm/project_repos/unicef-ai4d-poverty-mapping/notebooks/2023-01-17-initial-model-ph-mm-tl-kh/../../data/data_cache/ookla/processed/2aff65fdf8072457cba0d42873b7a9c2.csv? True
2023-01-31 14:00:37.051 | DEBUG | povertymapping.ookla:load_type_year_data:88 - Processed Ookla data for aoi, mobile 2019 (key: 2aff65fdf8072457cba0d42873b7a9c2) found in filesystem. Loading in cache.
2023-01-31 14:00:37.265 | INFO | povertymapping.nightlights:get_clipped_raster:414 - Retrieving clipped raster file /home/jc_tm/.geowrangler/nightlights/clip/7a58f067614b6685cd9bb62d4d15a249.tif
CPU times: user 1min 40s, sys: 19.6 s, total: 2min
Wall time: 2min 16s
# Split train/test data into features and labels# For labels, we just select the target label columnlabels = country_data[[label_col]]# For features, drop all columns from the input country geometries# If you need the cluster data, refer to country_data / country_testinput_dhs_cols = dhs_gdf.columnsfeatures = country_data.drop(input_dhs_cols, axis=1)features.shape, labels.shape
((441, 61), (441, 1))
# Clean features# For now, just impute nans with 0# TODO: Implement other cleaning stepsfeatures = features.fillna(0)
Base Features List
The features can be subdivided by the source dataset
OSM
<poi type>_count: number of points of interest (POI) of a specified type in that area
ex. atm_count: number of atms in cluster
poi_count: number of all POIs of all types in cluster
<poi_type>_nearest: distance of nearest POI of the specified type
ex. atm_nearest: distance of nearest ATM from that cluster
OSM POI types included: atm, bank, bus_stations, cafe, charging_station, courthouse, dentist (clinic), fast_food, fire_station, food_court, fuel (gas station), hospital, library, marketplace, pharmacy, police, post_box, post_office, restaurant, social_facility, supermarket, townhall, road
Ookla
The network metrics features follow the following name convention:
Performing 5-fold CV...
<generator object _RepeatedSplits.split at 0x7eff0a1c7040>
Instantiate model
For now, we will train a simple random forest model
from sklearn.ensemble import RandomForestRegressormodel = RandomForestRegressor(n_estimators=100, random_state=train_test_seed, verbose=0)model
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
Evaluate model training using cross-validation
We evalute the model’s generalizability when training over different train/test splits
Ideally for R^2 - We want a high mean: This means that we achieve a high model performance over the different train/test splits - We want a low standard deviation (std): This means that the model performance is stable over multiple training repetitions
For training the final model, we train on all the available data.
model.fit(features.values, labels.values.ravel())
RandomForestRegressor(random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.