SAT-4 and SAT-6 airborne datasets*


Saikat Basu, Robert DiBiano, Manohar Karki and Supratik Mukhopadhyay, Louisiana State University
Sangram Ganguly, Bay Area Environmental Research Institute/NASA Ames Research Center
Ramakrishna R. Nemani, NASA Advanced Supercomputing Division, NASA Ames Research Center

Images were extracted from the National Agriculture Imagery Program (NAIP) dataset. The NAIP dataset consists of a total of 330,000 scenes spanning the whole of the Continental United States (CONUS). We used the uncompressed digital Ortho quarter quad tiles (DOQQs) which are GeoTIFF images and the area corresponds to the United States Geological Survey (USGS) topographic quadrangles. The average image tiles are ~6000 pixels in width and ~7000 pixels in height, measuring around 200 megabytes each. The entire NAIP dataset for CONUS is ~65 terabytes. The imagery is acquired at a 1-m ground sample distance (GSD) with a horizontal accuracy that lies within six meters of photo-identifiable ground control points. The images consist of 4 bands - red, green, blue and Near Infrared (NIR). In order to maintain the high variance inherent in the entire NAIP dataset, we sample image patches from a multitude of scenes (a total of 1500 image tiles) covering different landscapes like rural areas, urban areas, densely forested, mountainous terrain, small to large water bodies, agricultural areas, etc. covering the whole state of California. An image labeling tool developed as part of this study was used to manually label uniform image patches belonging to a particular landcover class. Once labeled, 28x28 non-overlapping sliding window blocks were extracted from the uniform image patch and saved to the dataset with the corresponding label. We chose 28x28 as the window size to maintain a significantly bigger context, and at the same time not to make it as big as to drop the relative statistical properties of the target class conditional distributions within the contextual window. Care was taken to avoid interclass overlaps within a selected and labeled image patch.

The datasets are available here:
SAT-4 and SAT-6 datasets

Sample images from the SAT-4 and SAT-6 datasets:


Sample images from the SAT-4 and SAT-6 datasets

Dataset description:

The datasets are encoded as MATLAB .mat files that can be read using the standard load command in MATLAB. Each sample image is 28x28 pixels and consists of 4 bands - red, green, blue and near infrared. The training and test labels are 1x4 and 1x6 vectors for SAT-4 and SAT-6 respectively having a single 1 indexing a particular class from 0 through 4 or 6 and 0 values at all other indices.

SAT-4

SAT-4 consists of a total of 500,000 image patches covering four broad land cover classes. These include - barren land, trees, grassland and a class that consists of all land cover classes other than the above three. 400,000 patches (comprising of four-fifths of the total dataset) were chosen for training and the remaining 100,000 (one-fifths) were chosen as the testing dataset. We ensured that the training and test datasets belong to disjoint set of image tiles. Each image patch is size normalized to 28x28 pixels. Once generated, both the training and testing datasets were randomized using a pseudo-random number generator.

The MAT file for the SAT-4 dataset contains the following variables:

train_x 28x28x4x400000 uint8 (containing 400000 training samples of 28x28 images each with 4 channels)
train_y 400000x4 uint8 (containing 4x1 vectors having labels for the 400000 training samples)
test_x 28x28x4x100000 uint8 (containing 100000 test samples of 28x28 images each with 4 channels)
test_y 100000x4 uint8 (containing 4x1 vectors having labels for the 100000 test samples)


SAT-6

SAT-6 consists of a total of 405,000 image patches each of size 28x28 and covering 6 landcover classes - barren land, trees, grassland, roads, buildings and water bodies. 324,000 images (comprising of four-fifths of the total dataset) were chosen as the training dataset and 81,000 (one fifths) were chosen as the testing dataset. Similar to SAT-4, the training and test sets were selected from disjoint NAIP tiles. Once generated, the images in the dataset were randomized in the same way as that for SAT-4. The specifications for the various landcover classes of SAT-4 and SAT-6 were adopted from those used in the National Land Cover Data (NLCD) algorithm.

The MAT file for the SAT-6 dataset contains the following variables:

train_x 28x28x4x324000 uint8 (containing 324000 training samples of 28x28 images each with 4 channels)
train_y 324000x6 uint8 (containing 6x1 vectors having labels for the 324000 training samples)
test_x 28x28x4x81000 uint8 (containing 81000 test samples of 28x28 images each with 4 channels)
test_y 81000x6 uint8 (containing 6x1 vectors having labels for the 81000 test samples)


* To use this dataset, please cite the following paper:

Saikat Basu, Sangram Ganguly, Supratik Mukhopadhyay, Robert Dibiano, Manohar Karki and Ramakrishna Nemani, DeepSat - A Learning framework for Satellite Imagery, ACM SIGSPATIAL 2015.