blob: be14d1959465e6adb8c5db9c2421055facd5ba65 [file] [log] [blame]
Jan - 11 - 2007
**************************************************************************************************************************************************************************************************************************************************
*An Introduction to STEM II Data Generation Utilities for Diva*
Data generation for STEM II was done using a set of custom software utilities and data generators written in Java for the purpose. They can be found
under org.eclipse.stem.utility\src\generators. These generators were used to create STEM's properties files and maps for all countries in the Diva
data set (http://www.cipotato.org/DIVA/data/DataServer.htm). There are multiple steps involved in the process of data generation. Before running a data
generator there are a few intermediate steps to be done by custom utilities found in org.eclipse.stem.utility\src\generators. Furthermore, the intermediate
steps vary according to the particular document (i.e. area properity file, nodes property file, or map) we want to generate. So, we can think of data
generation for STEM II as consisting of a set of data generation processes where each process consists of a set of steps. Overall, we can identify the
following five data generation processes :
--Generation of Area Properties Files
--Generation of Population Properties Files
--Generation of Nodes Properties Files
--Generation of Names Properties Files
--Map Generation
Each process will be explained in detail, but before, we have to mention our data sources and the nature of the data we are dealing with.
**Data Sources for STEM II**
Our main source of data was DIVA-GIS (http://www.cipotato.org/DIVA/data/DataServer.htm), which is a repository for maps and GIS data. All DIVA-GIS
data is free and in the public domain. The data files are in shapefile format (http://en.wikipedia.org/wiki/Shapefile). According to Wikipedia, "The ESRI
Shapefile is a popular geospatial vector data format for geographic information systems software. It is developed and regulated by ESRI as a (mostly) open
specification for data interoperability among ESRI and other software products. A "shapefile" commonly refers to a collection of files with ".shp", ".shx",
".dbf" " All shapefiles downloaded from DIVA-GIS are zip compressed. Before getting to data generation, there are two steps to be done :
***Uncompressing***
The first step is to unzip all the downloaded DIVA-GIS shapefiles by using a compressing software like Winzip or 7-zip.
***Extraction****
The second step is to extract data for every possible shapefile into a plain text document. To do this, we run org.eclipse.stem.utility\src\generators\WorldReader.java.
WorldReader takes as an argument a list will all the files we want to extract data from. Such a list looks as follows :
//Taken from org.eclipse.stem.utility\dataMigration\input\world\World.txt
...
#THIS FILE IS THE INPUT TO WorldReader.java
#IT CONSISTS OF PAIRS OF THE FORM : [LOCATION;FILENAME]
C:\diva\Africa\Algeria;C:\diva\Africa\Algeria\Algeria.txt
C:\diva\Africa\Angola;C:\diva\Africa\Angola\Angola.txt
C:\diva\Africa\Benin;C:\diva\Africa\Benin\Benin.txt
C:\diva\Africa\BurkinaFaso;C:\diva\Africa\BurkinaFaso\BurkinaFaso.txt
C:\diva\Africa\Burundi;C:\diva\Africa\Burundi\Burundi.txt
...
The result is a set of plain-text files containing GIS data. We call it a Diva file. Next, we explain how a Diva file looks like.
***Overview of Diva Files***
A diva file is a simple-text file that contains GIS data and that looks as follows :
// Extracted from Argentina.txt
// Sample level 0 data
ARG,ARGENTINA,-0.5559767,-1.0146881,-0.5559825,-1.0147113, ...
// Sample level 1 data
ARG,ARGENTINA,BUENOS AIRES,-0.70200574,-1.0840372, ...
// Sample level 2 data
ARG,ARGENTINA,JUJUY,Santa Catalina,-0.38024858,-1.1561186, ...
In other words, it is just a set of rows with location data (i.e. ARG,ARGENTINA,JUJUY,Santa Catalina) and followed by polygon data corresponding
to a geographical location (i.e. -0.38024858,-1.1561186,-0.38023695,-1.1561148,-0.38024002,-1.1560922,-0.38024685,-1.1560414, ...). One restriction about
the Diva files is that data has to appear in the file based on administration level. In other words, we want to make sure that all level 0 data comes before
the level 1 data and that all level 2 data comes after level 0 and level 1 data respectively. The data generation utilities are sensitive to the ordering of
data based on the administration level. A data generator goes through internal transitions based on the administration level of the data it is currently processing.
Internal transitions of the data generator must follow the pattern : Level_0 -> Level_1 -> Level_2. Otherwise, if data does not conform to the Level_0 -> Level_1 -> Level_2 rule,
unexpected behavior might arise from the data generation utilities.
**Grouping**
After we uncompress and extract shapefiles into plain text documents, we notice that for some files the data for a single administration is scattered all over
the data file. As an example, for Wisconsin, we dont find all the rows of data for it consecutively, but scattered all over, as follows :
USA,United States, Wisconsin, 0.7917936,-1.5175203 ...
USA,United States,Wisconsin,0.81974524,-1.5800693 ...
USA,United States, New Hampshire,0.79072654,-1.2406557 ...
USA,United States, Michigan,0.7881631,-1.5015621 ...
...
USA,United States, Wisconsin, ...
...
USA,United States, Wisconsin, ...
...
and so on...
We want to group all data that belongs to Wisconsin consecutively. By doing this, we will produce a polygon with unique ID for Wisconsin instead of multiple
polygons with same ID. By definition, polygon IDs should be unique. We achievethe grouping by running org.eclipse.stem.utility\src\generators\GroupGenerator.java.
The resulting "grouped" data file will look as follows :
USA,United States,Kentucky,0.6370049,-1.5618098 ...
USA,United States,Wisconsin,0.8216533,-1.5777862 ...
USA,United States,Wisconsin,0.81974524,-1.5800693 ...
USA,United States,Wisconsin,0.8127316,-1.5780942 ...
USA,United States,Wisconsin,0.8175863,-1.580813 ...
USA,United States,Wisconsin,0.7917936,-1.5175203 ...
USA,United States,Wisconsin,0.7864648,-1.5271649 ...
USA,United States,Minnesota,0.8585817,-1.6607387 ...
Running GroupGenerator will produce as output files with the "_GROUPED" suffix attached. So for example, if the original data file for the USA is "USA.txt",
then the corresponding grouped file will have name "USA_GROUPED.txt"
NOTE : From this introduction we assume that all the generators and utilities are located under org.eclipse.stem.utility\src\generators\ except where explicitly
mentioned otherwise.
**Lexicographic Sorting**
After we have run GroupGenerator, we want to run a pair of utilities that will do a lexicographic sorting of the data files at administration levels 1 and 2.
These utilities are : Admin1LexSorter and Admin2LexSorter. Next, we explain both utilities in detail. The files produced by these utilities will be located under
org.eclipse.stem.utility\dataMigration\input\sort\<Three letter identifier for country>
***Admin1LexSorter: Lexicographic Sorting of Level 1 Administrations***
The goal of this program is to go through the source data file (i.e. USA.txt or USA_GROUPED.txt), do a lexicographic sorting, and produce as a result a file where
each row corresponds to a level 1 administration by lexicographic ordering. As an example, for the case of Argentina, the result would looks like this :
1:AR-B:AR:BUENOS AIRES
2:AR-C:AR:BUENOS AIRES D.F.
3:AR-K:AR:CATAMARCA
4:AR-H:AR:CHACO
5:AR-U:AR:CHUBUT
6:AR-X:AR:CORDOBA
7:AR-W:AR:CORRIENTES
8:AR-E:AR:ENTRE RIOS
9:AR-P:AR:FORMOSA
10:AR-Y:AR:JUJUY
11:AR-L:AR:LA PAMPA
12:AR-F:AR:LA RIOJA
13:AR-M:AR:MENDOZA
14:AR-N:AR:MISIONES
15:AR-Q:AR:NEUQUEN
16:AR-R:AR:RIO NEGRO
17:AR-A:AR:SALTA
18:AR-J:AR:SAN JUAN
19:AR-D:AR:SAN LUIS
20:AR-Z:AR:SANTA CRUZ
21:AR-S:AR:SANTA FE
22:AR-G:AR:SANTIAGO DEL ESTERO
23:AR-V:AR:TIERRA DEL FUEGO
24:AR-T:AR:TUCUMAN
The name of the resulting lexicographically-sorted file for Argentina is ARG_ADMIN1_LEX.txt and found under org.eclipse.stem.utility\dataMigration\input\sort\ARG
***Admin2LexSorter: Lexicographic Sorting of Level 2 Administrations***
Similar to Admin2LexSorter, except that now we sort lexicographically based on a the concatenation of the level 1 and level 2 administrations. This concatenation has
the effect of sorting all level 2 administrations within its level 1 container. As an example, the result of sorting all level 2 administrations that belong to the province
(level 1 administration) of La Rioja in Argentina, we get :
// Extracted from org.eclipse.stem.utility\dataMigration\input\sort\ARG_ADMIN2_LEX.txt
...
1:AR:ARG:ARGENTINA:LA RIOJA:Arauco
2:AR:ARG:ARGENTINA:LA RIOJA:Capital
3:AR:ARG:ARGENTINA:LA RIOJA:Castro Barros
4:AR:ARG:ARGENTINA:LA RIOJA:Chilecito
5:AR:ARG:ARGENTINA:LA RIOJA:Famatina
6:AR:ARG:ARGENTINA:LA RIOJA:General Angel V. Pesaloza
7:AR:ARG:ARGENTINA:LA RIOJA:General Belgrano
8:AR:ARG:ARGENTINA:LA RIOJA:General Juan F. Quiroga
9:AR:ARG:ARGENTINA:LA RIOJA:General Lamadrid
10:AR:ARG:ARGENTINA:LA RIOJA:General Lavalle
11:AR:ARG:ARGENTINA:LA RIOJA:General Ocampo
12:AR:ARG:ARGENTINA:LA RIOJA:General San Martin
13:AR:ARG:ARGENTINA:LA RIOJA:General Sarmiento
14:AR:ARG:ARGENTINA:LA RIOJA:Gobernador Gordillo
15:AR:ARG:ARGENTINA:LA RIOJA:Independencia
16:AR:ARG:ARGENTINA:LA RIOJA:Rosario Vera Penaloza
17:AR:ARG:ARGENTINA:LA RIOJA:San Blas de los Sauces
18:AR:ARG:ARGENTINA:LA RIOJA:Sanagasta
...
Similarly, the province of Cordoba in Argentina, the sorting produces ;
// Extracted from org.eclipse.stem.utility\dataMigration\input\sort\ARG_ADMIN2_LEX.txt
...
1:AR:ARG:ARGENTINA:MENDOZA:Capital
2:AR:ARG:ARGENTINA:MENDOZA:General Alvear
3:AR:ARG:ARGENTINA:MENDOZA:Godoy Cruz
4:AR:ARG:ARGENTINA:MENDOZA:Guaymallen
5:AR:ARG:ARGENTINA:MENDOZA:Junin
6:AR:ARG:ARGENTINA:MENDOZA:La Paz
7:AR:ARG:ARGENTINA:MENDOZA:Las Heras
8:AR:ARG:ARGENTINA:MENDOZA:Lavalle
9:AR:ARG:ARGENTINA:MENDOZA:Lujan
10:AR:ARG:ARGENTINA:MENDOZA:Maipu
11:AR:ARG:ARGENTINA:MENDOZA:Malarge
12:AR:ARG:ARGENTINA:MENDOZA:Rivadavia
13:AR:ARG:ARGENTINA:MENDOZA:San Carlos
14:AR:ARG:ARGENTINA:MENDOZA:San Martin
15:AR:ARG:ARGENTINA:MENDOZA:San Rafael
16:AR:ARG:ARGENTINA:MENDOZA:Sata Rosa
17:AR:ARG:ARGENTINA:MENDOZA:Tunuyan
18:AR:ARG:ARGENTINA:MENDOZA:Tupungato
...
*Other Sources of Data Used for Data Generation*
**Area and Population Data : GeoHive and CityPopulation**
Most of the area and population data we use was downloaded from GeoHive (http://www.geohive.com/default1.aspx). GeoHive
is a site with geopolitical data and statistics on the human population. However, we also found data at City Population (http://www.citypopulation.de/).
City Population's site provides global population statistics.
**ISO 3166 Data**
The main source of ISO 3166 data is an official ISO 3166 database purchased from ISO (International Organization
for Standardization). The database contains data for both ISO 3166-1 and ISO 3166-2.
**Lexicographically Sorted Data**
This is a set of data files generated by running Admin1LexSorter and Admin2LexSorter. This data will be useful when creating
identifiers for each location during the data generation process.
**URLs**
This is the set of data files containing URLs for every data we found online. We kept track of all the sources of our data and all of them
are accounted for. For each country, there is a corresponding URL file found under org.eclipse.stem/dataMigration/input/urls. A URL file consists
of rows of the form : KEYWORD:URL For example, for the country of Irak, its corresponding URL file looks like :
// Extracted from org.eclipse.stem/dataMigration/input/urls
GEOHIVE*http://www.xist.org/cntry/iraq.aspx
WIKI*http://en.wikipedia.org/wiki/Irak
CITYPOP*http://en.wikipedia.org/wiki/Irak
DIVA*http://www.cipotato.org/DIVA/data/DataServer.asp?AREA=IRQ&THEME=_adm
CIA*https://www.cia.gov/cia/publications/factbook/geos/iz.html
*Data Generators*
**Introduction**
A data generator is a simple Java program that takes as input a file or a set of files from the Diva set and produces as output a property file for STEM.
As it processes data, a generator goes through transitions according to the administrative level of the data that it is currently processing. It is expected
that the level of data will follow the pattern of transitions : Level_0 -> Level_1 -> Level_2. We now explain the function of each data generator.
**Generation of Names Properties Files**
To generate a name property file, we run org.eclipse.stem.utility\src\generators\NameGenerator.java. After running NameGenerator, the result is a
new name property file for a given country. The corresponding name file for Argentina looks as follows :
// Extracted from C:\stemII\org.eclipse.stem.internal.data\resources\data\country\AGO\AGO_names.properties.
...
# Level 1 (admin 1 = e.g., state)
AO-BGO = BENGO
AO-BGU = BENGUELA
AO-BIE = BIE
AO-CAB = CABINDA
AO-CCU = CUANDO CUBANGO
...
The purpose of a names property file is to define the identifiers for every administrative division in a country at each level.
A single names property file will be generated for every country. For more information on STEM's properties files, please read "An Introduction To STEM Properties Files"
found at org.eclipse.stem.utility\docs.
**Generation of Nodes Properties Files**
To generate a node property file, we run org.eclipse.stem.utility\src\generators\NodeGenerator.java. After running NodeGenerator, the result is a
new node property file for a given country. For a single country, a node property file will be generated for each administrative level. For Argentina, the corresponding node
property files produced will be :
ARG_0_node.properties
ARG_1_node.properties
ARG_2_node.properties
A sample node property file for Argentina looks as follows :
// Extracted from org.eclipse.stem.internal.data\resources\data\country\ARG_1_node.properties
...
AR-B = BUENOS AIRES, 01, B
AR-C = BUENOS AIRES D.F., 02, C
AR-K = CATAMARCA, 03, K
AR-H = CHACO, 04, H
AR-U = CHUBUT, 05, U
AR-X = CORDOBA, 06, X
AR-W = CORRIENTES, 07, W
AR-E = ENTRE RIOS, 08, E
AR-P = FORMOSA, 09, P
AR-Y = JUJUY, 10, Y
AR-L = LA PAMPA, 11, L
AR-F = LA RIOJA, 12, F
AR-M = MENDOZA, 13, M
AR-Q = NEUQUEN, 15, Q
AR-R = RIO NEGRO, 16, R
AR-A = SALTA, 17, A
AR-J = SAN JUAN, 18, J
AR-D = SAN LUIS, 19, D
AR-Z = SANTA CRUZ, 20, Z
AR-S = SANTA FE, 21, S
AR-G = SANTIAGO DEL ESTERO, 22, G
AR-V = TIERRA DEL FUEGO, 23, V
AR-T = TUCUMAN, 24, T
AR-N = MISIONES, 14, N
...
The purpose of a node property file is to provide additional information about identifiers of administrative divisions. For more
information on STEM's properties files, please read "An Introduction To STEM Properties Files" found at org.eclipse.stem.utility\docs.
**Generation of Population Properties Files**
There are two intermediate steps to be done before generating a population property file. First, we have to run both, Admin1LexSorter and Admin2LexSorter to generate a set of files were
administrations for a given country are sorted lexicographically as was previously explained in this document. Next, we run org.eclipse.stem.utility.generators\PopulationProfiler.java.
PopulationProfiler.java is a program that creates a profile of all the level 1 administrations of a country based on population and on the number of
level 2 administrations that belong or are contained in it. This is a necessary step since we are missing some exact population data and by knowing the relationship of level 2 administration to its
container we can easily compute an approximate population value. Finally, we run org.eclipse.stem.utility\PopulationGenerator.java to genererate a population property file. For a single country,
a population property file will be generated for each administrative level. For Argentina, the corresponding population property files produced will be :
ARG_0_human_2006.properties
ARG_1_human_2006.properties
ARG_2_human_2006.properties
A sample node property file for Argentina looks as follows :
// Extracted from org.eclipse.stem.internal.data\resources\data\country\ARG_1_human_2006.properties
...
# The population identifier
POPULATION = human
# State/Province
AR-B = 13827203
AR-C = 2776138
AR-K = 334568
AR-H = 984446
AR-U = 413237
AR-X = 3066801
AR-W = 930991
AR-E = 1158147
AR-P = 486559
AR-Y = 611888
AR-L = 299294
AR-F = 289983
AR-M = 1579651
AR-Q = 474155
...
The purpose of a population property file is to provide population data of administrative divisions. For more
information on STEM's properties files, please read "An Introduction To STEM Properties Files" found at org.eclipse.stem.utility\docs.
**Generation of Area Properties Files**
In a similar way to the population generation process, there is an intermediate step to be done before generating an area property file. We have to run
org.eclipse.stem.utility\PolygonAreaGenerator.java to generate a set of files with areas in polygon units for every administration in a country. This is
a necessary step since we are missing some exact area data. Having polygon area values (in polygon units) helps to compute an approximate area value. Finally,
we run org.eclipse.stem.utility\AreaGenerator.java to genererate an area property file. For a single country, an area property file will be generated for each
administrative level. For Argentina, the corresponding area property files produced will be :
ARG_0_area.properties
ARG_1_area.properties
ARG_2_area.properties
A sample area property file for Argentina looks as follows :
// Extracted from org.eclipse.stem.internal.data\resources\data\country\ARG_1_area.properties
...
# This is the source of the data
SOURCE = http://www.xist.org/cntry/argentina.aspx
# SOURCE = http://en.wikipedia.org/wiki/Argentina
# State/Province
AR-B = 307571
AR-C = 200
AR-K = 102602
AR-H = 99633
AR-U = 224686
AR-X = 165321
AR-W = 88199
AR-E = 78781
AR-P = 72066
AR-Y = 53219
AR-L = 143440
AR-F = 89680
AR-M = 148827
AR-Q = 94078
AR-R = 203013
...
The purpose of an area property file is to provide area data of administrative divisions. For more
information on STEM's properties files, please read "An Introduction To STEM Properties Files" found at org.eclipse.stem.utility\docs.
**Map Generation**
To generate maps we run an instance of org.eclipse.stem.utility\src\generators\GMLGenerator. GMLGenerator will create and XML file for each country in our dataset.
The maps are created using GML (Geography Markup Language) which is an XML grammar. According to Wikipedia, " The Geography Markup Language (GML) is the XML grammar defined by the Open Geospatial Consortium (OGC) to express
geographical features. GML serves as a modeling language for geographic systems as well as an open interchange format for geographic transactions on the Internet." All maps are found under
org.eclipse.stem.geography\resources\data\geo\country\. There can be more than one map for each country. In fact, there is a map for every administrative level for which we have polygon data available.
For Argentina, there are three corresponding maps :
ARG_0_MAP.xml
ARG_1_MAP.xml
ARG_2_MAP.xml
As an example, for Armenia the contents of the level 1 map looks as follows :
# Taken from ARM_1_MAP.xml
<Map>
<title>ARM Level 1 Map</title>
<subTitle>Administrative Boundaries</subTitle>
<updated>Tue Nov 07 16:57:55 PST 2006 </updated>
<entry>
<georss:where>
<gml:Polygon gml:id="AM-ER">
<gml:outerBoundaryIs>
<gml:LinearRing>
<gml:posList>
41.301422 45.004055 ... 41.301624 45.000002 41.301422 45.004055
</gml:posList>
</gml:LinearRing>
</gml:outerBoundaryIs>
</gml:Polygon>
</georss:where>
</entry>
</Map>
**Putting It All Together : running Properties Generator**
To be able to run properties files and maps we have developed a program whose purpose is to run all the generators on a given list of
countries. To do this, we run org.eclipse.stem.utility/src/generators/PropertiesGenerator.java. The generator will go through each single
country in the list and do all necessary work. The main logic for PropertiesGenerator is shown next :
...
final int CONFIG_FILE = 0;
final int PARAMS = 1;
if (args.length < PARAMS) {
System.out.println("--Wrong arguments--"); //$NON-NLS-1$
System.out
.println("\tTo run, please provide the following argument(s) : "); //$NON-NLS-1$
System.out.println("\t\t Configuration file"); //$NON-NLS-1$
System.exit(1);
}
// Generate the names.properties files for each country.
NameGenerator nameGen = new NameGenerator(args[CONFIG_FILE]);
nameGen.run();
// Run garbage collection
System.gc();
// Generate the population.properties files for each country.
PopulationGenerator popGen = new PopulationGenerator(args[CONFIG_FILE]);
popGen.run();
// Run garbage collection
System.gc();
// Generate the area.properties files for each country.
AreaGenerator areaGen = new AreaGenerator(args[CONFIG_FILE]);
areaGen.run();
// Run garbage collection
System.gc();
// Generate the node.properties files for each country.
NodeDataGenerator nodeDataGen = new NodeDataGenerator(args[CONFIG_FILE]);
nodeDataGen.run();
// Run garbage collection
System.gc();
// Generate the GML files for each country.
GMLGenerator gmlGen = new GMLGenerator(args[CONFIG_FILE]);
gmlGen.run();
...