HackerCrypt read, learn, code, trade

Fetch Historical Price Data with Python

Have you ever wanted to get your hands on a CSV file with the historical bitcoin OHLC1 prices over the past year, and use it as a training set for your machine learning algorithms?

Yeah, that’s easy!

Okay… but what about, having a granularity of 1 hour, or 30 minutes, or even 5 minutes?

In this post you’ll see how easy this is with just a few lines of Python code.

Environment Setup

Basic dependencies

Before you proceed any further, you need to have python and pip installed on your system. You can check this by running the following commands in your shell (terminal):

$ python --version
Python 2.7.10

$ pip --version
pip 9.0.1 from /Library/Python/2.7/site-packages (python 2.7)

If you encounter any errors, you might have to install Python from the official website or Homebrew. The same applies for pip.

Setup a virtual environment (optional)

If you want to work on a clear workspace, and not have conflicting pip packages with your other projects, you can use virtualenv:

$ pip install virtualenv # the package for virtual python environments
$ mkdir ~/python-virtualenv # create a folder for you virtual environments
$ cd ~/python-vertualenv
$ virtualenv ohlc-data # create an environment for the project
$ mkdir ~/ohlc-data # create a folder for the project
$ cd ~/ohlc-data 
$ source ~/python-virtualenv/ohlc-data/bin/activate # activate the virtual environment for the project

Required packages

$ pip install pandas # easy-to-use data structures for data analysis, time series and statistics
$ pip install requests # HTTP for humans

“Give me the data already!”

Okay, so in the interest of keeping this post shorter we will cover only how to fetch historical rates using the GDAX API, and in the future we will add Part 2 to the post where we look at other sources as well.

First, let’s create a GDAX class to handle our requests and data formatting:

class GDAX(object):
  def __init__(self, pair):
    self.pair = pair
    self.uri = 'https://api.gdax.com/products/{pair}/candles'.format(pair=self.pair)

Then, we need to do some grouping of the requests to GDAX, because:

The maximum number of data points for a single request is 200 candles. If your selection of start/end time and granularity will result in more than 200 data points, your request will be rejected. If you wish to retrieve fine granularity data over a larger time range, you will need to make multiple requests with new start/end ranges.

def fetch(self, start, end, granularity):
  data = []
  # We will fetch the candle data in windows of maximum 100 items.
  delta = timedelta(minutes=granularity * 100)

  slice_start = start
  while slice_start != end:
    slice_end = min(slice_start + delta, end)
    data += self.request_slice(slice_start, slice_end, granularity)
    slice_start = slice_end

  # I prefer working with some sort of a structured data, instead of
  # plain arrays.
  data_frame = pandas.DataFrame(data=data, columns=['time', 'low', 'high', 'open', 'close', 'volume'])
  data_frame.set_index('time', inplace=True)
  return data_frame

What is self.request_slice(slice_start, slice_end, granularity)? That’s the method handling the actual request to the GDAX API, and making sure it succeeds, in case of rate limiting or any server/client errors:

def request_slice(self, start, end, granularity):
  # Allow 3 retries (we might get rate limited).
  retries = 3
  for retry_count in xrange(0, retries):
    # From https://docs.gdax.com/#get-historic-rates the response is in the format:
    # [[time, low, high, open, close, volume], ...]
    response = requests.get(self.uri, {
      'start': GDAX.__date_to_iso8601(start),
      'end': GDAX.__date_to_iso8601(end),
      'granularity': granularity * 60  # GDAX API granularity is in seconds.
    })

    if response.status_code != 200 or not len(response.json()):
      if retry_count + 1 == retries:
        raise Exception('Failed to get exchange data for ({}, {})!'.format(start, end))
      else:
        # Exponential back-off.
        sleep(1.5 ** retry_count)
    else:
      # Sort the historic rates (in ascending order) based on the timestamp.
      result = sorted(response.json(), key=lambda x: x[0])
      return result

The last bit of code, which is pretty self-explanatory, is __date_to_iso8601, which simply converts a datetime object to an ISO-8601 formatted string, as required by GDAX:

@staticmethod
def __date_to_iso8601(date):
  return '{year}-{month:02d}-{day:02d}T{hour:02d}:{minute:02d}:{second:02d}'.format(
      year=date.year,
      month=date.month,
      day=date.day,
      hour=date.hour,
      minute=date.minute,
      second=date.second)

Example usage

Get 30 minute BTC/USD price information:

data_frame = GDAX('BTC-USD').fetch(datetime(2017, 6, 1), datetime(2017, 8, 1), 30)

1 day LTC/EUR:

data_frame = GDAX('LTC-EUR').fetch(datetime(2017, 6, 1), datetime(2017, 8, 1), 1440)

Save to CSV:

data_frame.to_csv('data.csv')

Putting it all together

You can find the complete code (with the examples) on our GitHub.

Please support us

If you liked this post and would like to keep us going, please support us on GitHub:

Star

Follow @HackerCrypt


  1. Open/High/Low/Close, for more information look at this post. 

comments powered by Disqus