SYS-CON MEDIA Authors: Andy Thurai, Liz McMillan, Kevin Benedict, Zakia Bouachraoui, Elizabeth White

Blog Feed Post

Building a Back Testing Platform for Algorithmic Trading

In this series, I’m going to outline in general, how to build a back-testing platform for the creation, tweaking, and subsequent execution of algorithms used in electronic trading.

Part One – The Data

I recently made some comments on Vertica’s blog in regards to what I considered to be a fairly bold claim.  They said that Vertica was the only real column store.  But even if they are, so what? In my comments, I alluded to my belief that we optimize problems to solutions – we try to fix stuff using what we’ve got in our toolbox without having to run to Home Depot.

The real test is when the rubber hits the road – how do you actually solve a problem in a new way that’s motivating.  And by motivating I mean the solution addresses the issues, enables new capabilities, and is economically attractive.

So rather than tell you that DarkStar and our approach to processing both real-time and historical data (there’s a difference?) is the Real Enchilada, I thought I would illustrate a real world use case.

Let’s say you want to store a bunch of market data.  And I mean a bunch.  You want to store every piece of market data for the whole US Equities market.

And you’d like to have this data so that you can run analytics on it.  Or maybe even back-test strategies for buying and selling stocks.  So let’s assume that you’ve got some java code lying around to do that.

For our example, we are interested in seeing whether or not using volume average weighted price strategies actually work.   In our example, we will pretend that we are buying a lot of stock, and the theory we want test is whether or not buying that stock during the day when it’s lower than it’s weighted average price will give us a better average price during the day than just going with the flow (often referred to as volume participation).

We are all familiar with how relational databases work, and anyone who’s been in capital markets for a while knows how futile it would be to use something like Oracle for this due to cost, hardware and just how difficult it is to get the data into the database in the first place,

Oh that’s right, I forgot to tell you, we are going to have to load this data first.

I am not going to go into the relevant benefits of a column store here either, you can check out many other websites for that.

Instead, let’s look at some issues.  First, I would rather load the data directly into the database as it happens.  Staging the data separately is costly and error prone. In addition, what happens when you decide to load that data and encounter a problem that can’t be fixed in time for the next market day?  What if you actually run out of space or compute to get caught up?  Well then you can’t back test the next day to further refine your algorithms.  Algo’s should be tweaked every day.  New algo’s need to be developed to remain competitive.  So here, a database error costs real money.

So I need a fast data store.

As I am loading the data, what happens if one of my disks goes boom? Or one of my machines go boom? Well, now I have a problem.  If I fail over to another datacenter, how do i reconcile? What a nightmare!

So I need a data store that we can take a sledgehammer to and it will keep running.

Hey, if I have this big historical data store, I still need to query it while it is being updated.  Ideally, I would like to also be running analysis and back testing during the day.  Scheduling jobs to run at night is so very ’90′s.

So my data store has to facilitate both interactive query and batch analysis.

But wait, doing all of this means that I am going to have to figure out how to use the same code for back-testing that i use to generate orders during market hours.  It’s either that or use some visual, script based or different harnesses for my java or C++ code.  Yet another nightmare.

So, I would Iike to run the same code against my historical data store that I also use to generate orders during the day.

There’s a bunch of other stuff too, management, instrumentation, removing old data that I don’t need for back-testing, all the stuff we associate with normal day to day big data operations. We need to know what’s going on during the day so that we can be proactive.  There’s gold in that data!

And one last thing, it would be really cool if most of this technology wasn’t proprietary.  I mean let’s face it, firms that talk more about their investors on their websites than their clients can’t possibly have my best interests at heart.

This is a tall list.  Let’s knock it down, one by one.

Here is a diagram for your consideration.

The diagram isn’t very technical, and that’s on purpose – I’m outlining an algorithm, or methodology that may or may not solve our problem.

In the diagram, I’ve depicted the database as a cluster of machines.  Instead of using one big machine backed by a SAN, I’m going to use a number of machines.  Each of those machines is going to connect to the Market Data source and get data.

As we receive the data, we’re going to take a peek at it, and determine where in the cluster that data needs to live and while we’re doing that, we’re going to right it to disk.  A background process will make sure that the data ends up on the node we want it on.  More of why that’s so incredibly important in Part Two – Analyzing the Data in this series.

Also, I’m going to ask the cluster to replicate everything we’re writing to it – we’re going to end up writing the data a total of 2 times in this example.  I might usually suggest 3, but we’ve got two data centers running the same solution, so I’ll actually have 4 copies of the data.

Why write the data to three nodes in the cluster?  First, if a node goes down, I still want to be able to write data.  If the node that goes down is the primary node, I’m going to remember that and when that node comes back up, I’m going to write all the data to it as part of its “Welcome Back to the Cluster Party!”  And second, if I’m reading data from the cluster (remember, we’ve got algo’s running and users querying this data), I want my data.  If a node goes down, your users don’t really care – they just want their data.  By replicating the data across multiple nodes, I achieve high availability without having to fail-over to another instance or data center.

Ok, we’ve got the Sledge Hammer test handled, which is cool, but everything I’ve described above sounds like it’s going to take a lot of time and that the system is going to very slow.

Not true.  Each node in the cluster above is subscribing to market data.  So if one machine can ingest X messages per second, then a cluster of 10 machines should be able to ingest 10 * X messages per second, right?  Let see what that means in a real world example:

On May 20, 2010, there were about 1.1 billion BBO messages as published via SuperFeed (NYSE’s market data platform), those quotes represent the best bid, bid size, offer, and offer size for each stock at any given time.  In terms of messages per second, that’s about 50,000.  In terms of size per second, that’s about 4,500k per second.  Hmmm, chunky!

These are intimidating numbers.  But if we divide the problem up a bit, and use 10 nodes in a cluster, each node only needs to ingest about 450k per second across 5,000 messages.  All of a sudden, we’re dealing with something quite reasonable.

So, now we’ve got a cluster that can load the entire market real time and it’s redundant.  What about analyzing the data?  That’s in Part Two – Analyzing the Data which I’ll post next week.

Print

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to [email protected]

Latest Stories
In today's always-on world, customer expectations have changed. Competitive differentiation is delivered through rapid software innovations, the ability to respond to issues quickly and by releasing high-quality code with minimal interruptions. DevOps isn't some far off goal; it's methodologies and practices are a response to this demand. The demand to go faster. The demand for more uptime. The demand to innovate. In this keynote, we will cover the Nutanix Developer Stack. Built from the foundat...
Cognitive Computing is becoming the foundation for a new generation of solutions that have the potential to transform business. Unlike traditional approaches to building solutions, a cognitive computing approach allows the data to help determine the way applications are designed. This contrasts with conventional software development that begins with defining logic based on the current way a business operates. In her session at 18th Cloud Expo, Judith S. Hurwitz, President and CEO of Hurwitz & ...
The Jevons Paradox suggests that when technological advances increase efficiency of a resource, it results in an overall increase in consumption. Writing on the increased use of coal as a result of technological improvements, 19th-century economist William Stanley Jevons found that these improvements led to the development of new ways to utilize coal. In his session at 19th Cloud Expo, Mark Thiele, Chief Strategy Officer for Apcera, compared the Jevons Paradox to modern-day enterprise IT, examin...
"Cloud computing is certainly changing how people consume storage, how they use it, and what they use it for. It's also making people rethink how they architect their environment," stated Brad Winett, Senior Technologist for DDN Storage, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"NetApp's vision is how we help organizations manage data - delivering the right data in the right place, in the right time, to the people who need it, and doing it agnostic to what the platform is," explained Josh Atwell, Developer Advocate for NetApp, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Sold by Nutanix, Nutanix Mine with Veeam can be deployed in minutes and simplifies the full lifecycle of data backup operations, including on-going management, scaling and troubleshooting. The offering combines highly-efficient storage working in concert with Veeam Backup and Replication, helping customers achieve comprehensive data protection for all their workloads — virtual, physical and private cloud —to meet increasing business demands for uptime and productivity.
The Software Defined Data Center (SDDC), which enables organizations to seamlessly run in a hybrid cloud model (public + private cloud), is here to stay. IDC estimates that the software-defined networking market will be valued at $3.7 billion by 2016. Security is a key component and benefit of the SDDC, and offers an opportunity to build security 'from the ground up' and weave it into the environment from day one. In his session at 16th Cloud Expo, Reuven Harrison, CTO and Co-Founder of Tufin, ...
While the focus and objectives of IoT initiatives are many and diverse, they all share a few common attributes, and one of those is the network. Commonly, that network includes the Internet, over which there isn't any real control for performance and availability. Or is there? The current state of the art for Big Data analytics, as applied to network telemetry, offers new opportunities for improving and assuring operational integrity. In his session at @ThingsExpo, Jim Frey, Vice President of S...
Rodrigo Coutinho is part of OutSystems' founders' team and currently the Head of Product Design. He provides a cross-functional role where he supports Product Management in defining the positioning and direction of the Agile Platform, while at the same time promoting model-based development and new techniques to deliver applications in the cloud.
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settl...
"We were founded in 2003 and the way we were founded was about good backup and good disaster recovery for our clients, and for the last 20 years we've been pretty consistent with that," noted Marc Malafronte, Territory Manager at StorageCraft, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In their Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, and Mark Lav...
@CloudEXPO and @ExpoDX, two of the most influential technology events in the world, have hosted hundreds of sponsors and exhibitors since our launch 10 years ago. @CloudEXPO and @ExpoDX New York and Silicon Valley provide a full year of face-to-face marketing opportunities for your company. Each sponsorship and exhibit package comes with pre and post-show marketing programs. By sponsoring and exhibiting in New York and Silicon Valley, you reach a full complement of decision makers and buyers in ...
There are many examples of disruption in consumer space – Uber disrupting the cab industry, Airbnb disrupting the hospitality industry and so on; but have you wondered who is disrupting support and operations? AISERA helps make businesses and customers successful by offering consumer-like user experience for support and operations. We have built the world’s first AI-driven IT / HR / Cloud / Customer Support and Operations solution.
Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., discussed how these tools can be leveraged to develop a lasting competitive advantage ...