14 June 2013

Open (Source) for Business?: My First Attempt at Deploying an Open Source Spatial Database

All modern geographers face the critical decision of  selection an appropriate GIS. There are many options from which to choose, and each offers unique benefits while invariable containing some drawbacks. Open source solutions  such as GrassGIS, Viking GIS, and my current favorite, Quantum GIS have the main advantage of being free to use, and allowing for great data portability between applications. In terms of close sourced GIS solutions, ESRI's ArcGIS rules the pack. ArcGIS is very prominently used by governments and other large organizations who want a solution that will offer reliability and that comes with support and complete documentation.

Well structured data is essential success in any GIS project.
All good databases are thoughtfully designed  in advnaced.

I cut my spatial teeth at University of California, Santa Barbara, where ESRI is the GIS of choice. Reflecting on my early GIS years, I feel somewhat shortchanged that my instructors did not expose me and my colleagues to open source GIS solutions. Once I graduated, I lost my school-provided ESRI seat license, and I needed to find a cheaper alternative to ESRI's products. And by cheaper, I mean free, because when it comes to GIS software, the options are either extremely expensive (ESRI) or free (most other GIS applications). 

After experimenting with a few options, I settled on Lisboa's Quantum GIS. Quantum GIS (qgis for short) has a clean and logical UI and is a comfortable switch for an ArcGIS user such as myself. I quickly discovered how to utilize the functionality I was looking for, and soon was using qgis to contently import, clip, rasterize, merge, and project, and export data.

Before too long, I had some large CSV datasets to work with for a consulting project. I converted the data into a shapefile, and quickly learned that shapefiles are not an effective storage format for 1.5 million data points, which unfortunately is the size of the data set with which I am working.

I wish I had more RAM. (Not actually me in this picture).
As my Dell laptop (with a humble Core i-3 chip and 4 gigs of slow RAM) tried desperately to crank away at spatial queries on this enormous shapefile, I become intimately acquainted with all varieties of program and system crashes. After a few unsuccessful days of booting and rebooting, I decided I needed to put my data in a more scalable format. 

Enter: The Spatial Database. 

Had I been using ESRI, a Geodatabase would have been just the ticket to manage such a large dataset. I fired up qgis and started looking for the "create GeoDB" function. Well, as it turns out, Geodatabases are a feature proprietary to ArcGIS, and ESRI does not embrace the sharing of functionality and open data standards which are common in the open source community. I was stunned to discover the Geodatabases as they exist in ArcGIS, are unique to ESRI products, and that a GeoDB cannot be access by any other application.

My assumption was that any database used to store spatial data was a Geodatabase. I was less than thrilled to learn that my relatively extensive experience working with Geodatabases would not apply to spatial databases in any other application... and every other spatial data application utilizes a common db protocol and ESRI alone does not embrace. Humph!

Soon I found myself in the world
of painful, general error messages. Help me, please!


How is a spatial database different from a "capital G" Geodatabase? After all, they serve largely the same function, in that they both are a fast way to store large amounts of spatial data in a hierarchical structure. ESRI's database solution is designed from the ground up to work with spatial data. It's very much a black-box solution - you give it data, and it lets you use that data in ArcGIS. A user does not have visibility into the configuration and implementation of the database - this is all handled under the hood by ArcGIS. As a user of open source GIS software, I learned that I would have to utilize the power of databases with my spatial data, I would have to roll my own database solution.

I have only a very general idea of how databases work: deploy the database, import your data, and then reference the database in an application to view and modify the data. So far, I have managed to get the database instance installed on my local machine. I selected POSTgres as my database system, and installed the POSTgis plugin to add spatial functionality to the database system. I used the included tool to store data (a single shapefile) in the database, and access that data from qgis. My understanding is that the main value of a database for spatial data comes in that data does not have to be contained within the shapefile format, but instead data of different types can be easily cross-referenced which allows for powerful analysis. Additionally, databases are much faster than shapefiles when manipulating large datasets, and are less prone to corruption during data writes.

I am still confused about the following:
  • How does one relate shapefiles to each other in a spatial database?
  • Can shapefiles of similar features (i.e. the same feature class) be combined upon import?
  • What is the best way to import bulk geodata into a spatial database?
  • How can I be sure data attributes are preserved?
  • How do I move data from my database back to a shapefile for transfer to a collegue?
  • Shapefiles are a great way to share data - is there a analog for sharing data from a database, or is export  to shapefile the best method?
  • How can projections and coordinate reference issues best be handled?
  • How do I modify data in a database (the db equivalent of editing a shapefile?)
  • Do databases support topologically-aware feature editing?
As I continue to tinker and learn, I hope to answer most if not all of the above questions. In the mean time, if you have any suggestions, please do let me know!

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.