geoparsepy.geo_preprocess_lib module¶
Pre-processing module to create and populate SQL tables for focus areas from an installed OpenStreetMap database. Pre-processed SQL tables are required for geoparsepy.geo_parse_lib
global focus area spec (required before focus area is preprocessed as it contains super region information):
{
'focus_area_id' : 'global_cities',
}
focus area spec (OSM ID’s):
{
'focus_area_id' : 'gr_paris',
'osmid': ['relation',71525, 'relation', 87002, 'relation', 86999, 'relation', 86985, 'relation', 85802, 'relation', 91776, 'relation', 72258, 'relation', 72148, 'relation', 31340, 'relation', 72020, 'relation', 85527, 'relation', 59321, 'relation', 37027, 'relation', 37026, 'relation', 104479, 'relation', 105122, 'relation', 105748, 'relation', 104868, 'relation', 108318, 'relation', 129550, 'relation', 130544, 'relation', 67826, 'relation', 67685, 'relation', 87628, 'relation', 87922],
'admin_lookup_table' : 'global_cities_admin',
}
focus area spec (name and super regions):
{
'focus_area_id' : 'southampton',
'admin': ['southampton','south east england', 'united kingdom'],
'admin_lookup_table' : 'global_cities_admin',
}
focus area spec (radius from point):
{
'focus_area_id' : 'oxford',
'radius': ['POINT(-1.3176274 51.7503955)', 0.2],
'admin_lookup_table' : 'global_cities_admin',
}
focus area spec (geom):
{
'focus_area_id' : 'solent_shipping_lane',
'geom': 'POLYGON(( ... ))',
'admin_lookup_table' : 'global_cities_admin',
}
focus area spec (places with only a point within a set of super regions):
{
'focus_area_id' : 'uk_places',
'place_types': ['suburb','quarter','neighbourhood','town','village','island','islet','archipelago'],
'parent_osmid': ['relation',62149],
'admin_lookup_table' : 'global_cities_admin',
}
{
'focus_area_id' : 'europe_places',
'place_types': ['suburb','quarter','neighbourhood','town','village','island','islet','archipelago'],
'parent_osmid': ['relation',62149, 'relation', 1403916, 'relation', 1311341, 'relation', 295480, 'relation', 9407, 'relation', 50046, 'relation', 2323309, 'relation', 51477, 'relation', 365331, 'relation', 51701, 'relation', 2978650, 'relation', 54224, 'relation', 52822, 'relation', 62273, 'relation', 51684, 'relation', 79510, 'relation', 72594, 'relation', 72596, 'relation', 49715, 'relation', 59065, 'relation', 60189, 'relation', 16239, 'relation', 218657, 'relation', 21335, 'relation', 14296, 'relation', 214885, 'relation', 1741311, 'relation', 2528142, 'relation', 90689, 'relation', 58974, 'relation', 60199, 'relation', 53296, 'relation', 2088990, 'relation', 53292, 'relation', 53293, 'relation', 186382, 'relation', 192307, 'relation', 174737, 'relation', 307787, 'relation', 1124039, 'relation', 365307, 'relation', 1278736],
'admin_lookup_table' : 'global_cities_admin',
}
note: uk, france, spain, portugal, andorra, denmark, holland, germany, italy, switzerland, norway, finland, sweden, ireland, czech republic, estonia, latvia, lithuania, poland, belarus, russia, austria, slovenia, hungary, slovakia, croatia, serbia, bosnia and herzegovina, romania, moldova, ukraine, montenegro, kosovo, albania, macedonia, bulgaria, greece, turkey, cyprus, monaco, malta, gibralta
focus area spec (global places with only a point):
{
'focus_area_id' : 'global_places',
'place_types': ['suburb','quarter','neighbourhood','town','village','island','islet','archipelago'],
'admin_lookup_table' : 'global_cities_admin',
}
- Performance notes:
Preprocesssing time is related to number of points/line/polygons (N) and number of admin regions (M). admin regions are cross-checked for containment so this there are N*M calculations to perform.
global_cities (300,000 polygons) will take about 3 days to compute on a 2 GHz CPU (only need do it once of course).
uk_places (20,000 points) takes 20 mins.
france_places (40,000 points) take 2 hours.
europe_places (420,000 points) is made up of each european country. this allows a sequential country-by-country calculation which reduced the size of M and is vastly quicker than global places. it takes 7 hours to compute.
north_america_places (usa and canada) (52,000 points) takes 1 hour to compute.
au_nz_places (australia and new zealand) (8,000 points) takes 3 mins to compute.
- Alternative approach is OSM reverse geocoding using Nominatim:
Nominatim is a Linux GPL script/lib used by OSM to create an indexed planet OSM dataset that can then be looked up for reverse geocoding (i.e. name -> OSMID)
HTTP service available via OSM but this does not scale for large number of locations (as throughput is too slow)
local deployment of Nominatim is possible but indexing is built into the planet OSM deployment scripts (i.e. takes 10+ days to run) and is apparently very complex and difficult to get working
-
geoparsepy.geo_preprocess_lib.
cache_preprocessed_locations
(database_handle, location_ids, schema, geospatial_config, timeout_statement=600, timeout_overall=600, spatial_filter=None)[source]¶ Load a set of previously preprocessed locations from database. The cached location structure returned is used by geoparsepy.geo_parse_lib functions.
- Parameters
database_handle (PostgresqlHandler.PostgresqlHandler) – handle to database object
location_ids (dict) – for each table the range of locations ids to load. A -1 for min or max indicates no min or max range. Use a range of (-1,-1) for all locations. e.g. { ‘focus1_admin’ : [nStartID,nEndID], … }
schema (str) – Postgres schema name under which tables will be created
geospatial_config (dict) – config object returned from a call to geoparsepy.geo_parse_lib.get_geoparse_config()
timeout_statement (int) – number of seconds to allow each SQL statement
timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)
spatial_filter (str) – OpenGIS spatial polygon to use as a spatial filter for returned locations (ST_Intersects used). This is optional and can be None.
- Returns
list structure containing location information to be used by geoparsepy.geo_parse_lib functions e.g. [loc_id,name,(osm_id,…),(admin_id,…),ST_AsText(geom),{tag:value},(variant_phrase, …)]
- Return type
-
geoparsepy.geo_preprocess_lib.
create_preprocessing_tables
(focus_area, database_handle, schema, table_point='point', table_line='line', table_poly='poly', table_admin='admin', timeout_statement=60, timeout_overall=60, delete_contents=False, logger=None)[source]¶ create preprocessing tables for a new focus area (if they do not already exist)
- Parameters
focus_area (dict) – focus area description to create tables for.
database_handle (PostgresqlHandler.PostgresqlHandler) – handle to database object
schema (str) – Postgres schema name under which tables will be created
table_point (str) – Table name suffix to append to focus area ID to make point table
table_line (str) – Table name suffix to append to focus area ID to make line table
table_poly (str) – Table name suffix to append to focus area ID to make polygon table
table_admin (str) – Table name suffix to append to focus area ID to make admin region table
timeout_statement (int) – number of seconds to allow each SQL statement
timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)
delete_contents (bool) – if True the contents of any existing tables will be deleted
logger (logging.Logger) – logger object (optional)
-
geoparsepy.geo_preprocess_lib.
execute_preprocessing_focus_area
(focus_area, database_pool, schema, table_point='point', table_line='line', table_poly='poly', table_admin='admin', timeout_statement=1209600, timeout_overall=1209600, logger=None)[source]¶ Populates preprocessing tables with locations for a new focus area. If the area has already been precomputed it will immediately return with the location ID range. Small areas (e.g. town) take a few minutes to compute. Large areas (e.g. city) take up to 1 hour to compute. The database tables will be SQL locked whilst the data import is occuring to ensure new locations are allocated continuous primary key IDs, allowing a simple result range (start ID and end ID) to be returned as opposed to a long list of IDs for each new location.
This funciton is process safe but not thread safe as it uses an internal Queue() object to coordinate multiple database cursors and efficiently insert data using parallel SQL inserts.
The global area (referenced using the focus area key ‘admin_lookup_table’) must be already created and available in the same schema as this new focus area.
If the table already exists and already has locations then no import wil occur (assuming area has already been preprocessed).
- Parameters
focus_area (dict) – focus area description
database_pool (dict) – pool of PostgresqlHandler.PostgresqlHandler objects for each of the 4 table types used to execute parallel SQL imports e.g. { ‘admin’ : PostgresqlHandler.PostgresqlHandler, … }
schema (str) – Postgres schema name under which tables will be created
table_point (str) – Table name suffix to append to focus area ID to make point table
table_line (str) – Table name suffix to append to focus area ID to make line table
table_poly (str) – Table name suffix to append to focus area ID to make polygon table
table_admin (str) – Table name suffix to append to focus area ID to make admin region table
timeout_statement (int) – number of seconds to allow each SQL statement
timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)
logger (logging.Logger) – logger object (optional)
- Returns
for each table a tuple with the new location primary key ID range e.g. { ‘focus1_admin’ : (nLocIDStart, nLocIDEnd), … }
- Return type
-
geoparsepy.geo_preprocess_lib.
execute_preprocessing_global
(global_area, database_pool, schema, table_point='point', table_line='line', table_poly='poly', table_admin='admin', timeout_statement=2000000, timeout_overall=2000000, logger=None)[source]¶ Populates preprocessing tables with locations for a global area (up to admin level 6). This can take a very long time (about 3 days on a 2GHz CPU) so the default database timeout values are large (e.g. 1209600). The database tables will be SQL locked whilst the data import is occuring to ensure new locations are allocated continuous primary key IDs, allowing a simple result range (start ID and end ID) to be returned as opposed to a long list of IDs for each new location.
This funciton is process safe but not thread safe as it uses an internal Queue() object to coordinate multiple database cursors and efficiently insert data using parallel SQL inserts.
A global area must be created before individual focus areas can be created since it will contain for each admin region all its super regions, calculated based on PostGIS geometry.
If the table already exists and already has locations then no import will occur (assuming area has already been preprocessed).
- Parameters
global_area (dict) – global area description
database_pool (dict) – pool of PostgresqlHandler.PostgresqlHandler objects for each of the 4 table types used to execute parallel SQL imports e.g. { ‘admin’ : PostgresqlHandler.PostgresqlHandler, … }
schema (str) – Postgres schema name under which tables will be created
table_point (str) – Table name suffix to append to focus area ID to make point table
table_line (str) – Table name suffix to append to focus area ID to make line table
table_poly (str) – Table name suffix to append to focus area ID to make polygon table
table_admin (str) – Table name suffix to append to focus area ID to make admin region table
timeout_statement (int) – number of seconds to allow each SQL statement
timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)
logger (logging.Logger) – logger object (optional)
- Returns
for each table a tuple with the new location primary key ID range e.g. { ‘global_admin’ : (nLocIDStart, nLocIDEnd), … }
- Return type
-
geoparsepy.geo_preprocess_lib.
get_point_for_osmid
(osm_id, osm_type, geom, database_handle, timeout_statement=60, timeout_overall=60)[source]¶ calculate a representative point for a OSMID. if its a relation lookup the admin centre, capital or label node members. if its a node use that point. otherwise use shapely lib to calc a centroid point for this polygon. this function requires a database connection to OSM planet database
- Parameters
osm_id (int) – OSM ID of a relation, way or node
osm_type (str) – OSM Type as returned by geoparsepy.geo_parse_lib.calc_OSM_type()
geom (str) – OpenGIS geometry for OSM ID
database_handle (PostgresqlHandler.PostgresqlHandler) – handle to database object connected to OSM database (with tables public.planet_osm_point and public.planet_osm_rels available)
timeout_statement (int) – number of seconds to allow each SQL statement
timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)
- Returns
coordinates (long, lat) for a point that represents this OSM ID well (e.g. admin centre or centroid)
- Return type
-
geoparsepy.geo_preprocess_lib.
thread_worker_sql_insert
(sql_list, sql_result_queue_dict, logger=None)[source]¶ Internal thread callback function to preprocess a new area. This function should nenver be called independantly of the execute_preprocessing_global() or execute_preprocessing_focus_area() functions.
All sql imports in list must be for the same database as we lock that database for the entire transaction to ensure inserted location IDs are sequential.
- Parameters
sql_list (list) – list of info required to execute sql query e.g. [ (‘focus1_admin’,sql_query_str,tuple_sql_data,logger,database_handle,timeout_statement,timeout_overall), … ]
sql_result_queue_dict (dict) – dict of Queue() instances for each table type to store SQL results in a thread safe way e.g. { ‘focus1_admin’ : Queue(), … }