geoparsepy.geo_preprocess_lib module

Pre-processing module to create and populate SQL tables for focus areas from an installed OpenStreetMap database. Pre-processed SQL tables are required for geoparsepy.geo_parse_lib

global focus area spec (required before focus area is preprocessed as it contains super region information):

{
        'focus_area_id' : 'global_cities',
}

focus area spec (OSM ID’s):

{
        'focus_area_id' : 'gr_paris',
        'osmid': ['relation',71525, 'relation', 87002, 'relation', 86999, 'relation', 86985, 'relation', 85802, 'relation', 91776, 'relation', 72258, 'relation', 72148, 'relation', 31340, 'relation', 72020, 'relation', 85527, 'relation', 59321, 'relation', 37027, 'relation', 37026, 'relation', 104479, 'relation', 105122, 'relation', 105748, 'relation', 104868, 'relation', 108318, 'relation', 129550, 'relation', 130544, 'relation', 67826, 'relation', 67685, 'relation', 87628, 'relation', 87922],
        'admin_lookup_table' : 'global_cities_admin',
}

focus area spec (name and super regions):

{
        'focus_area_id' : 'southampton',
        'admin': ['southampton','south east england', 'united kingdom'],
        'admin_lookup_table' : 'global_cities_admin',
}

focus area spec (radius from point):

{
        'focus_area_id' : 'oxford',
        'radius': ['POINT(-1.3176274 51.7503955)', 0.2],
        'admin_lookup_table' : 'global_cities_admin',
}

focus area spec (geom):

{
        'focus_area_id' : 'solent_shipping_lane',
        'geom': 'POLYGON(( ... ))',
        'admin_lookup_table' : 'global_cities_admin',
}

focus area spec (places with only a point within a set of super regions):

{
        'focus_area_id' : 'uk_places',
        'place_types': ['suburb','quarter','neighbourhood','town','village','island','islet','archipelago'],
        'parent_osmid': ['relation',62149],
        'admin_lookup_table' : 'global_cities_admin',
}
{
        'focus_area_id' : 'europe_places',
        'place_types': ['suburb','quarter','neighbourhood','town','village','island','islet','archipelago'],
        'parent_osmid': ['relation',62149, 'relation', 1403916, 'relation', 1311341, 'relation', 295480, 'relation', 9407, 'relation', 50046, 'relation', 2323309, 'relation', 51477, 'relation', 365331, 'relation', 51701, 'relation', 2978650, 'relation', 54224, 'relation', 52822, 'relation', 62273, 'relation', 51684, 'relation', 79510, 'relation', 72594, 'relation', 72596, 'relation', 49715, 'relation', 59065, 'relation', 60189, 'relation', 16239, 'relation', 218657, 'relation', 21335, 'relation', 14296, 'relation', 214885, 'relation', 1741311, 'relation', 2528142, 'relation', 90689, 'relation', 58974, 'relation', 60199, 'relation', 53296, 'relation', 2088990, 'relation', 53292, 'relation', 53293, 'relation', 186382, 'relation', 192307, 'relation', 174737, 'relation', 307787, 'relation', 1124039, 'relation', 365307, 'relation', 1278736],
        'admin_lookup_table' : 'global_cities_admin',
}
note: uk, france, spain, portugal, andorra, denmark, holland, germany, italy, switzerland, norway, finland, sweden, ireland, czech republic, estonia, latvia, lithuania, poland, belarus, russia, austria, slovenia, hungary, slovakia, croatia, serbia, bosnia and herzegovina, romania, moldova, ukraine, montenegro, kosovo, albania, macedonia, bulgaria, greece, turkey, cyprus, monaco, malta, gibralta

focus area spec (global places with only a point):

{
        'focus_area_id' : 'global_places',
        'place_types': ['suburb','quarter','neighbourhood','town','village','island','islet','archipelago'],
        'admin_lookup_table' : 'global_cities_admin',
}
Performance notes:
  • Preprocesssing time is related to number of points/line/polygons (N) and number of admin regions (M). admin regions are cross-checked for containment so this there are N*M calculations to perform.

  • global_cities (300,000 polygons) will take about 3 days to compute on a 2 GHz CPU (only need do it once of course).

  • uk_places (20,000 points) takes 20 mins.

  • france_places (40,000 points) take 2 hours.

  • europe_places (420,000 points) is made up of each european country. this allows a sequential country-by-country calculation which reduced the size of M and is vastly quicker than global places. it takes 7 hours to compute.

  • north_america_places (usa and canada) (52,000 points) takes 1 hour to compute.

  • au_nz_places (australia and new zealand) (8,000 points) takes 3 mins to compute.

Alternative approach is OSM reverse geocoding using Nominatim:
  • Nominatim is a Linux GPL script/lib used by OSM to create an indexed planet OSM dataset that can then be looked up for reverse geocoding (i.e. name -> OSMID)

  • HTTP service available via OSM but this does not scale for large number of locations (as throughput is too slow)

  • local deployment of Nominatim is possible but indexing is built into the planet OSM deployment scripts (i.e. takes 10+ days to run) and is apparently very complex and difficult to get working

  • see http://wiki.openstreetmap.org/wiki/Nominatim

geoparsepy.geo_preprocess_lib.cache_preprocessed_locations(database_handle, location_ids, schema, geospatial_config, timeout_statement=600, timeout_overall=600, spatial_filter=None)[source]

Load a set of previously preprocessed locations from database. The cached location structure returned is used by geoparsepy.geo_parse_lib functions.

Parameters
  • database_handle (PostgresqlHandler.PostgresqlHandler) – handle to database object

  • location_ids (dict) – for each table the range of locations ids to load. A -1 for min or max indicates no min or max range. Use a range of (-1,-1) for all locations. e.g. { ‘focus1_admin’ : [nStartID,nEndID], … }

  • schema (str) – Postgres schema name under which tables will be created

  • geospatial_config (dict) – config object returned from a call to geoparsepy.geo_parse_lib.get_geoparse_config()

  • timeout_statement (int) – number of seconds to allow each SQL statement

  • timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)

  • spatial_filter (str) – OpenGIS spatial polygon to use as a spatial filter for returned locations (ST_Intersects used). This is optional and can be None.

Returns

list structure containing location information to be used by geoparsepy.geo_parse_lib functions e.g. [loc_id,name,(osm_id,…),(admin_id,…),ST_AsText(geom),{tag:value},(variant_phrase, …)]

Return type

list

geoparsepy.geo_preprocess_lib.create_preprocessing_tables(focus_area, database_handle, schema, table_point='point', table_line='line', table_poly='poly', table_admin='admin', timeout_statement=60, timeout_overall=60, delete_contents=False, logger=None)[source]

create preprocessing tables for a new focus area (if they do not already exist)

Parameters
  • focus_area (dict) – focus area description to create tables for.

  • database_handle (PostgresqlHandler.PostgresqlHandler) – handle to database object

  • schema (str) – Postgres schema name under which tables will be created

  • table_point (str) – Table name suffix to append to focus area ID to make point table

  • table_line (str) – Table name suffix to append to focus area ID to make line table

  • table_poly (str) – Table name suffix to append to focus area ID to make polygon table

  • table_admin (str) – Table name suffix to append to focus area ID to make admin region table

  • timeout_statement (int) – number of seconds to allow each SQL statement

  • timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)

  • delete_contents (bool) – if True the contents of any existing tables will be deleted

  • logger (logging.Logger) – logger object (optional)

geoparsepy.geo_preprocess_lib.execute_preprocessing_focus_area(focus_area, database_pool, schema, table_point='point', table_line='line', table_poly='poly', table_admin='admin', timeout_statement=1209600, timeout_overall=1209600, logger=None)[source]

Populates preprocessing tables with locations for a new focus area. If the area has already been precomputed it will immediately return with the location ID range. Small areas (e.g. town) take a few minutes to compute. Large areas (e.g. city) take up to 1 hour to compute. The database tables will be SQL locked whilst the data import is occuring to ensure new locations are allocated continuous primary key IDs, allowing a simple result range (start ID and end ID) to be returned as opposed to a long list of IDs for each new location.

This funciton is process safe but not thread safe as it uses an internal Queue() object to coordinate multiple database cursors and efficiently insert data using parallel SQL inserts.

The global area (referenced using the focus area key ‘admin_lookup_table’) must be already created and available in the same schema as this new focus area.

If the table already exists and already has locations then no import wil occur (assuming area has already been preprocessed).

Parameters
  • focus_area (dict) – focus area description

  • database_pool (dict) – pool of PostgresqlHandler.PostgresqlHandler objects for each of the 4 table types used to execute parallel SQL imports e.g. { ‘admin’ : PostgresqlHandler.PostgresqlHandler, … }

  • schema (str) – Postgres schema name under which tables will be created

  • table_point (str) – Table name suffix to append to focus area ID to make point table

  • table_line (str) – Table name suffix to append to focus area ID to make line table

  • table_poly (str) – Table name suffix to append to focus area ID to make polygon table

  • table_admin (str) – Table name suffix to append to focus area ID to make admin region table

  • timeout_statement (int) – number of seconds to allow each SQL statement

  • timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)

  • logger (logging.Logger) – logger object (optional)

Returns

for each table a tuple with the new location primary key ID range e.g. { ‘focus1_admin’ : (nLocIDStart, nLocIDEnd), … }

Return type

dict

geoparsepy.geo_preprocess_lib.execute_preprocessing_global(global_area, database_pool, schema, table_point='point', table_line='line', table_poly='poly', table_admin='admin', timeout_statement=2000000, timeout_overall=2000000, logger=None)[source]

Populates preprocessing tables with locations for a global area (up to admin level 6). This can take a very long time (about 3 days on a 2GHz CPU) so the default database timeout values are large (e.g. 1209600). The database tables will be SQL locked whilst the data import is occuring to ensure new locations are allocated continuous primary key IDs, allowing a simple result range (start ID and end ID) to be returned as opposed to a long list of IDs for each new location.

This funciton is process safe but not thread safe as it uses an internal Queue() object to coordinate multiple database cursors and efficiently insert data using parallel SQL inserts.

A global area must be created before individual focus areas can be created since it will contain for each admin region all its super regions, calculated based on PostGIS geometry.

If the table already exists and already has locations then no import will occur (assuming area has already been preprocessed).

Parameters
  • global_area (dict) – global area description

  • database_pool (dict) – pool of PostgresqlHandler.PostgresqlHandler objects for each of the 4 table types used to execute parallel SQL imports e.g. { ‘admin’ : PostgresqlHandler.PostgresqlHandler, … }

  • schema (str) – Postgres schema name under which tables will be created

  • table_point (str) – Table name suffix to append to focus area ID to make point table

  • table_line (str) – Table name suffix to append to focus area ID to make line table

  • table_poly (str) – Table name suffix to append to focus area ID to make polygon table

  • table_admin (str) – Table name suffix to append to focus area ID to make admin region table

  • timeout_statement (int) – number of seconds to allow each SQL statement

  • timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)

  • logger (logging.Logger) – logger object (optional)

Returns

for each table a tuple with the new location primary key ID range e.g. { ‘global_admin’ : (nLocIDStart, nLocIDEnd), … }

Return type

dict

geoparsepy.geo_preprocess_lib.get_point_for_osmid(osm_id, osm_type, geom, database_handle, timeout_statement=60, timeout_overall=60)[source]

calculate a representative point for a OSMID. if its a relation lookup the admin centre, capital or label node members. if its a node use that point. otherwise use shapely lib to calc a centroid point for this polygon. this function requires a database connection to OSM planet database

Parameters
  • osm_id (int) – OSM ID of a relation, way or node

  • osm_type (str) – OSM Type as returned by geoparsepy.geo_parse_lib.calc_OSM_type()

  • geom (str) – OpenGIS geometry for OSM ID

  • database_handle (PostgresqlHandler.PostgresqlHandler) – handle to database object connected to OSM database (with tables public.planet_osm_point and public.planet_osm_rels available)

  • timeout_statement (int) – number of seconds to allow each SQL statement

  • timeout_overall (int) – number of seconds total to allow each SQL statement (including retries)

Returns

coordinates (long, lat) for a point that represents this OSM ID well (e.g. admin centre or centroid)

Return type

tuple

geoparsepy.geo_preprocess_lib.thread_worker_sql_insert(sql_list, sql_result_queue_dict, logger=None)[source]

Internal thread callback function to preprocess a new area. This function should nenver be called independantly of the execute_preprocessing_global() or execute_preprocessing_focus_area() functions.

All sql imports in list must be for the same database as we lock that database for the entire transaction to ensure inserted location IDs are sequential.

Parameters
  • sql_list (list) – list of info required to execute sql query e.g. [ (‘focus1_admin’,sql_query_str,tuple_sql_data,logger,database_handle,timeout_statement,timeout_overall), … ]

  • sql_result_queue_dict (dict) – dict of Queue() instances for each table type to store SQL results in a thread safe way e.g. { ‘focus1_admin’ : Queue(), … }