| Feb - 05 - 2007 |
| |
| ************************************************************************************************************************************************************************************************************************************************** |
| |
| *An Introduction to STEM II Data Reduction Utilities* |
| |
| |
| The size of the maps produced by the data generation utilities can be quite large. When we run a simulation in STEM II |
| that involves the whole world it might take a few seconds to load. This is noticeable and is expected since for such a |
| global simulation we deal with hundreds of maps, each of while can be quite big. As an example, for Argentina, we have the |
| following file sizes : |
| |
| ARG_0_MAP.xml 318 KB |
| ARG_1_MAP.xml 1,375 KB |
| ARG_2_MAP.xml 4,813 KB |
| |
| |
| **Data Reduction** |
| |
| To overcome this problem, we created a set of utilities that produce a reduced data set. Ideally, we want to reduce the |
| size of a file by 50% but still having the maps to display nicely with as few gaps and overlaps as possible. We created a Down-sampling |
| or reduction algorithm for our map's polygons based on the Law of Cosines from Trigonometry. By derivation from the formula for the law of cosines, |
| we can find each of the internal angles of a triangle: alpha, beta, and gamma. Our Down-sampling will consist of marking for deletion all points in |
| our polygon where the angle is less than a given threshold, say 5 degrees, with the condition that on a single pass we cannot delete two consecutive points. |
| By avoiding to delete two consecutive points during the same pass, we reduce the chance of having undesirable visual artifacts like gaps. |
| |
| **The Down-sampling Algorithm** |
| |
| The algorithm we designed is an iterative one. It is a rather simple one. It takes two arguments, a threshold in degrees and the number of iterations. |
| The algorithm works as follows, every three points we apply the derivation of the law of cosines to find each of the three interior angles for the triangle formed |
| by three consecutive points. Next, we apply a clever algorithm to decide which is the best point to remove (if any) from the current selection. The logic that |
| deals with marking points for deletion and "sliding" to the next selection is the following : |
| |
| # Taken from org.eclipse.stem.utility/generators/SinglePassDownSampler |
| |
| // Check to see if the initial, leftmost point is greater than |
| // alpha or if it is UNSAFE. If so, then slide forward the window. |
| if (a1 > alpha) { |
| // Just slide forward the window one unit. |
| left_border += 1; |
| right_border += 1; |
| continue; |
| } |
| |
| if (s1.equals(UNSAFE)) { |
| |
| if (a2 < alpha && a2 <= a3) { |
| // [ P1 = UNSAFE, P2 = SAFE, P3 = UNSAFE ] |
| // Advance sliding window after P2 |
| s2 = SAFE; |
| s3 = UNSAFE; |
| } else if (a3 < alpha) { |
| // [ P1 = UNSAFE, P2 = UNSAFE, P3 = SAFE ] |
| // Advance sliding window after P2 |
| s3 = SAFE; |
| s2 = UNSAFE; |
| } |
| |
| } else if (a1 < alpha && a1 < a2) { |
| // [ P1 = SAFE, P2 = UNSAFE, P3 = UNKNOWN ] |
| // Advance sliding window after P2 |
| s1 = SAFE; |
| s2 = UNSAFE; |
| // Now check to see if is safe to mark P3 too. |
| if (a3 < alpha) { |
| // [ P1 = SAFE, P2 = UNSAFE, P3 = SAFE ] |
| // Advance sliding window after P2 |
| s3 = SAFE; |
| } |
| } else if (a1 < alpha && a1 > a2) { |
| // [ P1 = UNSAFE, P2 = SAFE, P3 = UNSAFE ] |
| // SPECIAL CASE : Advance sliding window after P3 |
| s1 = UNSAFE; |
| s2 = SAFE; |
| s3 = UNSAFE; |
| // Special case : set flag to advance sliding window after P3 |
| flag = true; |
| |
| } else if (a1 > alpha && a2 > alpha) { |
| // [ P1 = UNSAFE, P2 = UNSAFE, P3 = UNKNOWN ] |
| // Advance sliding window after P2 |
| s1 = UNSAFE; |
| s2 = UNSAFE; |
| } |
| |
| NOTE: A sliding window refers only to a selection of three consecutive points. |
| |
| |
| **Down-sampling Utilities** |
| |
| |
| To reduce a data set, we run org.eclipse.stem.utility/src/generators/DownSampler.java. We need to specify two arguments : a threshold |
| and the number of iterations. Internally, DownSampler will run both SinglePassDownSampler for the first run and the MultiPassDownSampler |
| for the remaining iterations. |
| |
| At each iteration, two files are generated : a reduced map and a reduced data file. The name of the reduced map tells us during which iteration |
| it was generated. For example, for ARG_1_REDUCED_MAP_2.xml would tell us that this is a level 1 map for Argentina created during the second iteration. |
| Similarly, ARG_REDUCED_2.txt is the reduced data file that was produced during the second iteration. ARG_REDUCED_2.txt will be the input data source for the |
| down-sampling algorithm during the third iteration. What's more, ARG_1_REDUCED_MAP_2.xml can be regenerated from this file. For a threshold of five degrees and |
| for five iterations, in the particular case of Austria we have that all the following files are generated : |
| |
| |
| # Reduced maps : |
| AUT_1_REDUCED_MAP_1.xml |
| AUT_1_REDUCED_MAP_2.xml |
| AUT_1_REDUCED_MAP_3.xml |
| AUT_1_REDUCED_MAP_4.xml |
| AUT_1_REDUCED_MAP_5.xml |
| |
| # Reduced data files : |
| AUT_REDUCED_1.txt |
| AUT_REDUCED_2.txt |
| AUT_REDUCED_3.txt |
| AUT_REDUCED_4.txt |
| AUT_REDUCED_5.txt |
| |
| **Down-sampling statistics** |
| |
| The following two metrics were considered when generating a reduced data set out of the Diva set : points-per-polygon and map file size. The results are |
| captured in the following two tables : |
| |
| |
| |
| |
| # REDUCTION METRIC : AVG POINTS-PER-POLYGON |
| |
| |
| Threshold (in degrees) Avg. points-per-polygon (after 5 iterations) Percentage of reduction |
| |
| INITIAL VALUE 250 x |
| 1 179 29% |
| 2 169 32.40% |
| 3 164 34.40% |
| 4 160 36% |
| 5 157 37.25% |
| |
| |
| |
| |
| # REDUCTION METRIC : FILE SIZE |
| |
| Threshold (in degrees) Size (in KB, after 5 iterations) Percentage of reduction |
| |
| ORIGINAL SIZE USA_1_MAP.xml = 2987 x |
| USA_2_MAP.xml = 17266 x |
| |
| USA_1_MAP.xml = 2216 26% |
| USA_2_MAP.xml = 14259 16% |
| |
| USA_1_MAP.xml = 1742 42% |
| USA_2_MAP.xml = 12091 30% |
| |
| USA_1_MAP.xml = 1689 44.50% |
| USA_2_MAP.xml = 11720 32% |
| |
| USA_1_MAP.xml = 1649 45% |
| USA_2_MAP.xml = 11464 33.70% |
| |
| USA_1_MAP.xml = 1607 46.30% |
| USA_2_MAP.xml = 11239 35% |
| |
| |
| |
| We see that by running the down-sampling utility we get a reduction of around 37.25% in the average number of points-per-polygon, a |
| reduction of 46.30% on the level 1 map and a reduction of 35% on the level 2 map. The average reduction of both maps combined is approx 41%. |
| |
| |
| **Reduction in Precision** |
| |
| We come close to achieving a 50% reduction in file size by decreasing the precision on the reduced data files from six decimal digits to four. |
| |
| |
| |
| |
| |