Downloading Large Amounts of Data

Occasionally, there is a need to download large amounts of data. This can be accomplished using the wget facility. Because this is a rare need, but one which puts extra load on the server, you should make an arrangement with Roger Nelson.

If you wish to retrieve the entire database for a given day, month, etc. this may readily be done through direct retrieval of the compressed CSV files (about 20 to 25 Mb per day uncompressed). An effective way to do this is with a program called 'wget' which is available for both Linux and Windows platforms (the latter can be found at http://gnuwin32.sourceforge.net/packages/wget.htm). You can also download one day at a time from a regular web browser and simply save the resulting file.

Using the wget command will look something like the following example, which would retrieve a single day of data. You can substitute the day of your choice by changing the 2010 and 2010-01-01 portions of the file name. (The command must be all on one line.)

1) Retrieve a day (20-25 Mb) -- example is Jan 1, 2010. (command all one line):

wget -N -r -nH --cut-dirs=1 --limit-rate=125k http://global-mind.org/data/eggsummary/2010/basketdata-2010-01-01.csv.gz

A somewhat similar method can be used for larger amounts of data. If you need to do this, please contact us.

2) Retrieve a month (~750 Mb) -- example is Jan, 2010 (command all one line):

wget -N -r -nH --cut-dirs=1 --limit-rate=125k http://global-mind.org/data/eggsummary/wget/wget201001.html

3) Retrieve a year (~9 Gb) -- example is 2010 (command all one line):

wget -N -r -nH --cut-dirs=1 --limit-rate=125k http://global-mind.org/data/eggsummary/wget/wget2010.html

Note that in each case the command must be on a single line, although the presentation in your browser may have split it.

For reference, the other arguments in the command above do the following:

-N -- keep track of time and date, and don't download the data if you already have the latest copy

-r -- go recursively into subdirectories, if needed

-nH -- don't create subdirectory for global-mind.org (a so-called "host" directory)

--cut-dirs=1 -- don't create a 'data' subdirectory, but do create an 'eggsummary' subdirectory. To avoid the latter, you could use --cut-dirs=2 instead, and so forth...

--limit-rate=125k -- be nice and share the server bandwidth with other users.


Go to Main Results Page

GCP Home