Skip to main content
Peckham
PROJECT
GPT
TRADES
Machine LearningResearchNLP

GPT-Powered Stock Trading Research

Can a large transformer model act as an all-in-one, news-based trading bot? A senior research project.

Read the full paper PDF.

Background

Back in the dark ages (yes, I mean before Chat-GPT took the internet by storm), transformer-based text generation models were known only to AI researchers and the nerdiest of programmers. It took until the release of OpenAI's GPT-2 in early 2019 before I joined the hype train. Like everyone else at the time, I wanted to co-opt the power of transformers for fun and profit, but alas OpenAI was only allowing other big-time researchers, or deep-pocketed corporations access to their groundbreaking technology. Thankfully in the meantime, Ben Wang and Aran Komatsuzaki were assembling a crack team of open-source gods to compete with OpenAI. Working at breakneck pace, they released GPT-J, a free and open competitor to GPT-2 and GPT-3 by early 2021, just in time for me to make use of their work in my senior research project for my college degree.

Timeline

Timeline

My Research Project

Concept

In mid-2021, there was lots of excitement around the flexibility of large transformer models. I wanted to see if I could use GPT-J as an all-in-one news-based trading bot. The idea being, if a model has read enough of the internet as background, it should understand the context of a news article, and output a trading signal for a given company.

Input: News article about a company ⮕ Output: Buy/Sell/Neutral signal

Data gathering

For the best chance of success, I decided to fine-tune the GPT-J-6B model with data formatted like the prompts I intended to test with. I scraped and labeled 140,000 news articles from the below web sources:

Data Sources

To avoid rate limiting, I created a custom web scraper to use and manage a pool of rotating proxies. I also used a custom rate limiter to ensure that I was not making too many requests to any given domain. Running the scraper from an OracleVM, I was able to scrape around 40 articles per minute and complete all 140,000 articles in around 2.5 days.

class Scraper:
    """
    Scrapes a list of seed urls and calls the parser function to process each result.
    The parser function should accept two arguments: the response object and the url.
    The parser may pass a list of new urls to scrape with the addUrls(urls) method.
    The scraper attempts to never scrape the same url twice.
    """
    def __init__(self, seedURLs = [], parser = None, options = None):
        emptyLogFile()
        self.options = {
            'cookieDirectory': './cookies/',
            'scrapeThreads': 10,
            'rateLimits': {},
            'maxAttempts': 3,
            'proxyUpdateInterval': 10
        }
        if options:
            self.options.update(options)

        self._cookies = loadCookies(self.options['cookieDirectory'])
        self._parser = parser

        self._urlQueue = queue.PriorityQueue()
        self._finishedUrls = {}
        self._finishedUrlsLock = threading.Lock()
        self.addUrls(seedURLs)

        self._threadNumber = self.options['scrapeThreads']

        self._proxySessions = queue.Queue()
        for proxy in self.verifyProxies(self.getProxies()):
            self._proxySessions.put(proxy)

        self._rateLimits = {}
        self._rateLimitsLock = threading.Lock()
        for domain in self.options['rateLimits']:
            self._rateLimits[domain] = {
                'limit': self.options['rateLimits'][domain],
                'lastRequest': {}
            }

    def addUrls(self, urls):
        """ Adds a list of urls to the list of urls to scrape. """
        count = 0
        with self._finishedUrlsLock:
            for url in urls:
                if url not in self._finishedUrls:
                    self._urlQueue.put((0, url))
                    count += 1

    def run(self):
        """ Starts the scraping threads and the proxy maintenance thread. """
        scrapingThreads = []
        for i in range(self._threadNumber):
            t = threading.Thread(target = self._scrape)
            t.start()
            scrapingThreads.append(t)

        proxyMaintThread = threading.Thread(target = self._maintainProxies)
        proxyMaintThread.start()

        for t in scrapingThreads:
            t.join()
        proxyMaintThread.join()
        successfulScrapes = len([v for v in self._finishedUrls.values() if v])
        print(f"Finished scraping. {len(self._finishedUrls)} urls scraped. {successfulScrapes} successful scrapes.")

I then split my dataset into a training set with 90,000 articles and into validation/testing sets with 10,000/40,000 articles respectively.

Training

Because the model weights are over 60GB, specialized computing was required for this training task. The training was done with a Google TPU-v3 generously donated by Google's TPU Research Cloud. I trained my model using the following hyperparameters:

{
  "layers": 28,
  "d_model": 4096,
  "n_heads": 16,
  "n_vocab": 50400,

  "warmup_steps": 160,
  "anneal_steps": 1530,
  "lr": 1.2e-4,
  "end_lr": 1.2e-5,
  "weight_decay": 0.1,
  "total_steps": 1700,

  "tpu_size": 8,

  "bucket": "peckham_tpu_europe",
  "model_dir": "mesh_jax_stock_model_slim_f16",

  "train_set": "stocks.train.index",
  "val_set": {
    "stocks": "stocks.val.index"
  },

  "val_batches": 5777,
  "val_every": 500,
  "ckpt_every": 500,
  "keep_every": 10000
}

Training Loss

At first, I was encouraged by seeing a meaningful decrease in training loss, however, now I believe this was due to overfitting to the training set.

Training Loss

Results

Fine-tuned model: 50.58% (N=40K) correct guesses of stock price movement direction.

Standard GPT-J model: 50.63% (N=40K) correct guesses of stock price movement direction.

Human testing: 56% (N=120) correct guesses of stock price movement direction.

Conclusion

In my testing, GPT-J performs no better than random when predicting the direction of price movement based on a news story. Furthermore, fine-tuning the model does not improve prediction accuracy.

I believe these results are due to the following factors:

  • The model (except for a small amount of fine-tuning) is not trained on financial data.
  • The data was not cleaned well enough to remove noise.
  • Text data is not a good representation of financial data. In other words, the model doesn't care about the actual numbers, only how well they "sound" in the context of the article.
  • The articles may also not have enough information to predict price movement anyway. News is often considered a lagging indicator of price movement, and the articles I scraped were often published after the relevant price movement had already occurred.