blob: e1b28130f420200bd71355ef73e9c65a8c48e81c [file] [log] [blame]
<?php
/*******************************************************************************
* Copyright (c) 2015 Eclipse Foundation and others.
* All rights reserved. This program and the accompanying materials
* are made available under the terms of the Eclipse Public License v1.0
* which accompanies this distribution, and is available at
* http://eclipse.org/legal/epl-v10.html
*
* Contributors:
* Eric Poirier (Eclipse Foundation) - Initial implementation
*******************************************************************************/
?>
<h1 class="article-title"><?php echo $pageTitle; ?></h1>
<p>
<a target="_blank"
href="https://www.locationtech.org/projects/technology.geotrellis">GeoTrellis</a>
is a Scala project developed to support low latency geospatial
data processing. GeoTrellis currently supports raster Map Algebra
functions and transportation network features. It and used to both
build fast, scalable web applications and perform batch processing
of large data sets by supporting distributed and parallel
processing that takes advantage of both multi-core processors and
clusters of computing devices.
</p>
<p>Over the past year the GeoTrellis project has undergone some
significant changes. While many of these changes had been
contemplated for a while, an opportunity to apply GeoTrellis to
processing climate change data helped to accelerate their
integration.</p>
<h2>It’s About Time for Climate Data</h2>
<p>A succession of hurricanes, typhoons, droughts and flooding in
recent years has kept climate change on the front page for many
communities and has underscored the need for improving local
resilience to climate-related events. However, while climate
impact at the planetary and continental scale are common news
items, it is very difficult to translate this into specific
actions that can be taken at the local or regional levels.
<p>The conventional data sources for long-term climate forecasts are
from the Intergovernmental Panel on Climate Change (IPCC) which
releases climate forecasts every five years. These forecasts are
the results of global circulation models (GCM) developed by
leading climate research labs around the world. Each forecast
includes approximately 100 years of projections for multiple
temperature and precipitation variables. Each model is run based
on the multiple carbon emission scenarios and more than thirty GCM
models, each with daily and monthly output. The data released by
the IPCC is not high resolution, but there are several efforts to
"down-sample" this data to higher resolutions. Whether or not the
climate forecasts are down-sampled, the climate forecast database
can easily be up to dozens of terabytes in size.</p>
<p>GeoTrellis is designed for working with large geospatial
databases, but the climate data is not simply geospatial. The data
is usually available in the NetCDF format. Each NetCDF file
contains monthly or daily forecasts for the next century (to
2099). One could think of this as a stack of spatial layers in
which each day or month forecast is a separate slice. GeoTrellis
was designed for working with large geospatial layers but not
necessarily for data sets that combine space and time.</p>
<p>
Apart from the climate data, the GeoTrellis framework had some
additional objectives that were causing a reconsideration of its
architecture. The GeoTrellis project implemented its distributed
processing features using <a target="_blank"
href="http://akka.io/">Akka</a>, a Scala framework for building
concurrent and distributed applications. After implementing some
sophisticated geospatial data processing capabilities using the
Akka framework, some new requirements began to emerge. These
included support for caching, sharding of data sets across a
storage cluster, and some more advanced fault tolerance features.
It was feasible to build these features on top of our existing
Akka-based work, but it was going to be a big project, so we took
a look at other frameworks that might help us implement these
features more rapidly, and the <a target="_blank"
href="https://spark.apache.org/">Spark</a> project emerged as
one of the better options.
</p>
<h2>Lighting a Spark in GeoTrellis</h2>
<p>
The Apache Spark project was originally developed by <a
target="_blank"
href="https://en.wikipedia.org/wiki/Matei_Zaharia">Matei Zaharia</a>
at UC Berkeley AMPLab to support fast cluster computing. It
includes several components. The Spark Core implements distributed
tasking, scheduling and basic I/O. Data is partitioned across
machines using an abstraction called Resilient Distributed
Datasets (RDDs). RDDs are then exposed via language-specific APIs.
The developer manipulates the RDDs without having to worry about
how the data is stored. The Spark ecosystem also includes Spark
SQL for working with structured data, Spark Streaming for working
with real-time streams of data, MLib for machine learning, and
GraphX for graph data processing. While the ability to shard large
data sets across nodes, support for a distributed file system and
caching of intermediate results in memory were the key features
that attracted us to Spark, the other work happening in the Spark
ecosystem was also complementary and would likely have additional
benefits in the future.
</p>
<p>Nonetheless, integrating Spark in GeoTrellis was going to be a
significant effort that would break all of the existing
functionality. Further, it was clear that even after a Spark-based
version of GeoTrellis was complete, it would require additional
time to port all of the existing spatial data transformation
functions to the new infrastructure. We decided to continue to
support the existing Akka-based functionality while the Spark
integration occurred in parallel and only deprecate the old
architecture once all of the existing features could be supported.</p>
<h2>Using Climate Change to Integrate Spark</h2>
<p>
In spring 2014, one of the companies supporting GeoTrellis
development, Azavea, received a research grant from the U.S.
Department of Energy to support fast computation of climate impact
metrics for local and regional decision-makers. Over the past
eight months, we have used the large climate forecast data sets
and the climate impact metric calculation use case as the catalyst
for making the shift to Spark. Along the way, we have had some
incredible support from The <a target="_blank"
href="http://www.nature.org/ourinitiatives/urgentissues/global-warming-climate-change/">Nature
Conservancy</a>, which has provided a well-organized set of
climate forecasts in NetCDF format, assistance from Amazon Web
Services (AWS) in the form of a Climate Research Grant, and a lot
of hard work on the part of the GeoTrellis contributors. Spark has
enabled us to add support for the Hadoop Distributed File System
(HDFS). In addition, the large number of tiles that result from
sharding tens of thousands of days of climate forecasts has led us
to integrate support for <a target="_blank"
href="https://accumulo.apache.org/">Accumulo</a> as a mechanism
for efficiently indexing the tiles. This work is now largely
complete and we are looking forward to a 0.10 release of
GeoTrellis that will wrap up this significant re-architecture into
a neat package.
</p>
<br /> <a
href="/community/eclipse_newsletter/2014/december/images/climate-toolkit.png"><img
src="/community/eclipse_newsletter/2014/december/images/climate-toolkit.png"
alt="climate toolkit" width="600" /></a> <br />
<h2>Summary</h2>
<p>Working with climate data raised a number of key challenges that
caused the GeoTrellis team to accelerate implementation of an
otherwise daunting set of architectural changes. By integrating
Spark, we have managed to maintain our focus on both lightning-fast
spatial data processing and cluster-scale batch processing while
also gaining several new features that lay a foundation for future
growth in use of the GeoTrellis platform. In the process, we were
also able to implement a prototype for a web application that will
enable cities and regions to more effectively use the climate
forecast data in ways that are relevant to their local
circumstances.
</p>
<p>
<i>Some of the work cited in this article was support by a grant
from the U.S. Department of Energy (grant # DE-SC0011303).</i>
</p>
<br />
<h4>Resources</h4>
<ul>
<li><a target="_blank"
href="https://github.com/geotrellis/geotrellis">GeoTrellis on
GitHub</a></li>
<li><a target="_blank" href="http://geotrellis.io/">GeoTrellis
Documentation</a></li>
<li><a target="_blank" href="https://spark.apache.org/">Apache
Spark</a></li>
<li><a target="_blank"
href="https://github.com/geotrellis/geotrellis/issues?milestone=6&q=is%3Aopen">Issue
tracker for 0.10</a></li>
<li><a target="_blank" href="http://aws.amazon.com/nasa/nex/">Down-sampled
Climate Data on AWS</a></li>
<li><a target="_blank" href="http://www.ipcc-data.org/">IPCC
Climate Data Distribution Centre</a></li>
<li>IRC: #geotrellis on freenode</li>
</ul>
<div class="bottomitem">
<h3>About the Authors</h3>
<div class="row">
<div class="col-sm-12">
<div class="row">
<div class="col-sm-8">
<img class="author-picture"
src="/community/eclipse_newsletter/2014/march/images/robert75.png"
width="75" alt="Robert Cheetham" />
</div>
<div class="col-sm-16">
<p class="author-name">
Robert Cheetham<br />
<a target="_blank" href="http://www.azavea.com/">Azavea</a>
</p>
<ul class="author-link">
<li><a target="_blank" href="http://www.azavea.com/blogs/labs/">Blog</a></li>
<li><a target="_blank" href="https://twitter.com/rcheetham">Twitter</a></li>
<!--<li><a target="_blank" href="https://plus.google.com/+KaiKreuzer">Google +</a></li>
$og -->
</ul>
</div>
</div>
</div>
</div>
</div>