| <?php |
| /******************************************************************************* |
| * Copyright (c) 2015 Eclipse Foundation and others. |
| * All rights reserved. This program and the accompanying materials |
| * are made available under the terms of the Eclipse Public License v1.0 |
| * which accompanies this distribution, and is available at |
| * http://eclipse.org/legal/epl-v10.html |
| * |
| * Contributors: |
| * Eric Poirier (Eclipse Foundation) - Initial implementation |
| *******************************************************************************/ |
| ?> |
| |
| <h1 class="article-title"><?php echo $pageTitle; ?></h1> |
| |
| <p> |
| <a target="_blank" |
| href="https://www.locationtech.org/projects/technology.geotrellis">GeoTrellis</a> |
| is a Scala project developed to support low latency geospatial |
| data processing. GeoTrellis currently supports raster Map Algebra |
| functions and transportation network features. It and used to both |
| build fast, scalable web applications and perform batch processing |
| of large data sets by supporting distributed and parallel |
| processing that takes advantage of both multi-core processors and |
| clusters of computing devices. |
| </p> |
| |
| <p>Over the past year the GeoTrellis project has undergone some |
| significant changes. While many of these changes had been |
| contemplated for a while, an opportunity to apply GeoTrellis to |
| processing climate change data helped to accelerate their |
| integration.</p> |
| |
| <h2>It’s About Time for Climate Data</h2> |
| <p>A succession of hurricanes, typhoons, droughts and flooding in |
| recent years has kept climate change on the front page for many |
| communities and has underscored the need for improving local |
| resilience to climate-related events. However, while climate |
| impact at the planetary and continental scale are common news |
| items, it is very difficult to translate this into specific |
| actions that can be taken at the local or regional levels. |
| |
| |
| <p>The conventional data sources for long-term climate forecasts are |
| from the Intergovernmental Panel on Climate Change (IPCC) which |
| releases climate forecasts every five years. These forecasts are |
| the results of global circulation models (GCM) developed by |
| leading climate research labs around the world. Each forecast |
| includes approximately 100 years of projections for multiple |
| temperature and precipitation variables. Each model is run based |
| on the multiple carbon emission scenarios and more than thirty GCM |
| models, each with daily and monthly output. The data released by |
| the IPCC is not high resolution, but there are several efforts to |
| "down-sample" this data to higher resolutions. Whether or not the |
| climate forecasts are down-sampled, the climate forecast database |
| can easily be up to dozens of terabytes in size.</p> |
| |
| <p>GeoTrellis is designed for working with large geospatial |
| databases, but the climate data is not simply geospatial. The data |
| is usually available in the NetCDF format. Each NetCDF file |
| contains monthly or daily forecasts for the next century (to |
| 2099). One could think of this as a stack of spatial layers in |
| which each day or month forecast is a separate slice. GeoTrellis |
| was designed for working with large geospatial layers but not |
| necessarily for data sets that combine space and time.</p> |
| |
| <p> |
| Apart from the climate data, the GeoTrellis framework had some |
| additional objectives that were causing a reconsideration of its |
| architecture. The GeoTrellis project implemented its distributed |
| processing features using <a target="_blank" |
| href="http://akka.io/">Akka</a>, a Scala framework for building |
| concurrent and distributed applications. After implementing some |
| sophisticated geospatial data processing capabilities using the |
| Akka framework, some new requirements began to emerge. These |
| included support for caching, sharding of data sets across a |
| storage cluster, and some more advanced fault tolerance features. |
| It was feasible to build these features on top of our existing |
| Akka-based work, but it was going to be a big project, so we took |
| a look at other frameworks that might help us implement these |
| features more rapidly, and the <a target="_blank" |
| href="https://spark.apache.org/">Spark</a> project emerged as |
| one of the better options. |
| </p> |
| |
| <h2>Lighting a Spark in GeoTrellis</h2> |
| <p> |
| The Apache Spark project was originally developed by <a |
| target="_blank" |
| href="https://en.wikipedia.org/wiki/Matei_Zaharia">Matei Zaharia</a> |
| at UC Berkeley AMPLab to support fast cluster computing. It |
| includes several components. The Spark Core implements distributed |
| tasking, scheduling and basic I/O. Data is partitioned across |
| machines using an abstraction called Resilient Distributed |
| Datasets (RDDs). RDDs are then exposed via language-specific APIs. |
| The developer manipulates the RDDs without having to worry about |
| how the data is stored. The Spark ecosystem also includes Spark |
| SQL for working with structured data, Spark Streaming for working |
| with real-time streams of data, MLib for machine learning, and |
| GraphX for graph data processing. While the ability to shard large |
| data sets across nodes, support for a distributed file system and |
| caching of intermediate results in memory were the key features |
| that attracted us to Spark, the other work happening in the Spark |
| ecosystem was also complementary and would likely have additional |
| benefits in the future. |
| </p> |
| |
| <p>Nonetheless, integrating Spark in GeoTrellis was going to be a |
| significant effort that would break all of the existing |
| functionality. Further, it was clear that even after a Spark-based |
| version of GeoTrellis was complete, it would require additional |
| time to port all of the existing spatial data transformation |
| functions to the new infrastructure. We decided to continue to |
| support the existing Akka-based functionality while the Spark |
| integration occurred in parallel and only deprecate the old |
| architecture once all of the existing features could be supported.</p> |
| |
| <h2>Using Climate Change to Integrate Spark</h2> |
| <p> |
| In spring 2014, one of the companies supporting GeoTrellis |
| development, Azavea, received a research grant from the U.S. |
| Department of Energy to support fast computation of climate impact |
| metrics for local and regional decision-makers. Over the past |
| eight months, we have used the large climate forecast data sets |
| and the climate impact metric calculation use case as the catalyst |
| for making the shift to Spark. Along the way, we have had some |
| incredible support from The <a target="_blank" |
| href="http://www.nature.org/ourinitiatives/urgentissues/global-warming-climate-change/">Nature |
| Conservancy</a>, which has provided a well-organized set of |
| climate forecasts in NetCDF format, assistance from Amazon Web |
| Services (AWS) in the form of a Climate Research Grant, and a lot |
| of hard work on the part of the GeoTrellis contributors. Spark has |
| enabled us to add support for the Hadoop Distributed File System |
| (HDFS). In addition, the large number of tiles that result from |
| sharding tens of thousands of days of climate forecasts has led us |
| to integrate support for <a target="_blank" |
| href="https://accumulo.apache.org/">Accumulo</a> as a mechanism |
| for efficiently indexing the tiles. This work is now largely |
| complete and we are looking forward to a 0.10 release of |
| GeoTrellis that will wrap up this significant re-architecture into |
| a neat package. |
| </p> |
| <br /> <a |
| href="/community/eclipse_newsletter/2014/december/images/climate-toolkit.png"><img |
| src="/community/eclipse_newsletter/2014/december/images/climate-toolkit.png" |
| alt="climate toolkit" width="600" /></a> <br /> |
| |
| <h2>Summary</h2> |
| <p>Working with climate data raised a number of key challenges that |
| caused the GeoTrellis team to accelerate implementation of an |
| otherwise daunting set of architectural changes. By integrating |
| Spark, we have managed to maintain our focus on both lightning-fast |
| spatial data processing and cluster-scale batch processing while |
| also gaining several new features that lay a foundation for future |
| growth in use of the GeoTrellis platform. In the process, we were |
| also able to implement a prototype for a web application that will |
| enable cities and regions to more effectively use the climate |
| forecast data in ways that are relevant to their local |
| circumstances. |
| </p> |
| <p> |
| <i>Some of the work cited in this article was support by a grant |
| from the U.S. Department of Energy (grant # DE-SC0011303).</i> |
| </p> |
| <br /> |
| <h4>Resources</h4> |
| <ul> |
| <li><a target="_blank" |
| href="https://github.com/geotrellis/geotrellis">GeoTrellis on |
| GitHub</a></li> |
| <li><a target="_blank" href="http://geotrellis.io/">GeoTrellis |
| Documentation</a></li> |
| <li><a target="_blank" href="https://spark.apache.org/">Apache |
| Spark</a></li> |
| <li><a target="_blank" |
| href="https://github.com/geotrellis/geotrellis/issues?milestone=6&q=is%3Aopen">Issue |
| tracker for 0.10</a></li> |
| <li><a target="_blank" href="http://aws.amazon.com/nasa/nex/">Down-sampled |
| Climate Data on AWS</a></li> |
| <li><a target="_blank" href="http://www.ipcc-data.org/">IPCC |
| Climate Data Distribution Centre</a></li> |
| <li>IRC: #geotrellis on freenode</li> |
| </ul> |
| |
| |
| <div class="bottomitem"> |
| <h3>About the Authors</h3> |
| |
| <div class="row"> |
| <div class="col-sm-12"> |
| <div class="row"> |
| <div class="col-sm-8"> |
| <img class="author-picture" |
| src="/community/eclipse_newsletter/2014/march/images/robert75.png" |
| width="75" alt="Robert Cheetham" /> |
| </div> |
| <div class="col-sm-16"> |
| <p class="author-name"> |
| Robert Cheetham<br /> |
| <a target="_blank" href="http://www.azavea.com/">Azavea</a> |
| </p> |
| <ul class="author-link"> |
| <li><a target="_blank" href="http://www.azavea.com/blogs/labs/">Blog</a></li> |
| <li><a target="_blank" href="https://twitter.com/rcheetham">Twitter</a></li> |
| <!--<li><a target="_blank" href="https://plus.google.com/+KaiKreuzer">Google +</a></li> |
| $og --> |
| </ul> |
| </div> |
| </div> |
| </div> |
| </div> |
| </div> |