By Uday Keith, Byte Academy
In the previous post in this series, Uday presented an overview of KMeans Algorithm.
Choosing the “right” K
Elbow Method
The elbow method allows us to make a decision on a value of K via visual aid. We will try to break up our data into a different number of K clusters and plot each K clustertype against the corresponding W(Ck). An example is below.
We choose the value of K at the position when the decrease in the W(Ck) for values of K begins increasing. So, for the example below, the optimal K appears to be 2 since the decrease in W(Ck) between K = 1 and K=2 is larger then drop in K = 3 and K =2. So, we visually look for the “elbow” of the curve.
Silhouette Method/Analysis
Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette coefficient displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like the number of clusters. This measure has a range of [1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.
The Silhouette Coefficient is calculated using the mean withincluster distance/variation (a) and the mean nearestcluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b – a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of.
There is a visual component to plot the silhouette’s you can follow at: http://scikitlearn.org/stable/autoexamples/cluster/plotkmeanssilhouetteanalysis.html
Conclusion
Clustering is challenging in that at the outset, because it is not clear whether the output is going to be useful. If we were to create 3 clusters or 8 clusters with a dataset, how do we know which one is the correct choice? Let us say using the Online Retail Dataset, we concluded that there are 6 customertypes or clusters. Based on this, the marketing department of the company sent out email advertisements to customers as per their cluster assignment. A clustering would be useful if customers interested in deals for electronic products actually received an email with those products. He or she would hopefully click on the advertisement and purchase an item.
Over time, we could evaluate the clustering based on the overall response of the customers on the email advertisements. Many clicks or purchases would reflect an appropriate clustering. If not, however, clearly the clustering needs to be adjusted.
This is exactly the challenge with clustering using Kmeans or any other method. While guides like the Elbow or Silhouette method exist, we can never be exactly sure of the validity of our clustering. Even so, Kmeans is a powerful and quick algorithm that if used wisely in conjunction with domain knowledge, can produce great results.
Any trading symbols displayed are for illustrative purposes only and are not intended to portray recommendations.
Byte Academy is based in New York, USA. It offers coding education, classes in FinTech, Blockchain, DataSci, Python + Quant.
This article is from Byte Academy and is being posted with Byte Academy’s permission. The views expressed in this article are solely those of the author and/or Byte Academy and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broadbased economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.
19567
The articles in this series are available as follows: Part I, Part II, Part III, Part IV, Part V, Part VI, and Part VII
In this post, Kris will discuss the importance of backtesting in algo trading.
Nearly all research related to algorithmic trading is empirical in nature. That is, it is based on observations and experience. Contrast this with theoretical research, which is based on assumptions, logic and a mathematical framework. Often, we start with a theoretical approach (for example, a timeseries model that we assume describes the process generating the market data we are interested in) and then use empirical techniques to test the validity of our assumptions and framework. But we would never commit money to a mathematical model that we assumed described the market without testing it using real observations, and every model is based on assumptions (to my knowledge no one has ever come up with a comprehensive model of the markets based on first principles logic and reasoning). So, empirical research will nearly always play a role in the type of work we do in developing trading systems.
So why is that important?
Empirical research is based on observations that we obtain through experimentation. Sometimes we need thousands of observations in order to carry out an experiment on market data, and since market data arrives in real time, we might have to wait a very long time to run such an experiment. If we mess up our experimental setup or think of a new idea, we would have to start the process all over again. Clearly this is a very inefficient way to conduct research.
A much more efficient way is to simulate our experiment on historical market data using computers. In the context of algorithmic trading research, such a simulation of reality is called a backtest. Backtesting allows us to test numerous variations of our ideas or models quickly and efficiently and provides immediate feedback on how they might have performed in the past. This sounds great, but in reality, backtesting is fraught with difficulties and complications, so I decided to write an article that I hope illustrates some of these issues and provides some guidance on how to deal with them.
Why Backtest?
Before I get too deeply into backtesting theory and its practical application, let’s back up and talk about why we might want to backtest at all. I’ve already said that backtesting helps us to carry out empirical research quickly and efficiently.
In the world of determinism (that is, welldefined cause and effect), natural phenomena can be represented by tractable, mathematical equations. Engineers and scientists reading this will be wellversed for example in Newton’s laws of motion. These laws quantify a physical consequence given a set of initial conditions and are solvable by anyone with a working knowledge of high school level mathematics. The markets however are not deterministic (at least not in the sense that the information we can readily digest describes the future state of the market).
Backtesting on past data could help provide a framework in which to conduct experiments and gather information that supports or detracts from a conclusion.
Backtesting accuracy can be affected by:
In the next post, Kris will discuss Development Methodology.
Learn more about Robot Wealth here: https://robotwealth.com/
This article is from Robot Wealth and is being posted with Robot Wealth’s permission. The views expressed in this article are solely those of the author and/or Robot Wealth and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broadbased economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.
19588
By Lamarcus Coleman
In this post, we will learn about linear discriminant analysis and how it can be used within quantitative portfolio management. We will briefly review what linear discriminant analysis is and apply it to managing the risk of a quantitative portfolio.
What is Linear Discriminant Analysis? Or What is LDA?
Linear Discriminant Analysis, also known as LDA, is a supervised machine learning algorithm that can be used as a classifier and is most commonly used to achieve dimensionality reduction. To get an idea of what LDA is seeking to achieve, let’s briefly review linear regression. Linear regression is a parametric, supervised learning model. The goal of linear regression is to predict a quantitative response, or label. We seek to achieve this by building a linear model that can take in a single or various parameter(s) or predictor(s). These predictors can be quantitative, or numerical and or qualitative or categorical.
Unlike in linear regression, where we seek to construct a model to predict a specific quantitative response, in LDA our goal is to predict a qualitative response or a class label. An example of this is constructing a model to predict the direction of an asset rather than one, in linear regression terms, that would predict the actual price of the asset. More specifically, LDA would seek to predict the probability of the direction.
You may be thinking that if Logistic Regression and LDA are both attempting to find the conditional probability of an observation falling within a specific class, or k, given a value for x, or our predictor, why wouldn’t we just use Logistic Regression?
In short, Logistic Regression works well when our class, or qualitative response, is binary, for example, is either 1 or 0, or True/False. But, when we work with data in which our class may have more than two possible states, LDA can provide a better alternative. Also, LDA may give us a better prediction when there is a high degree of separability between our class and given other conditions, can be more stable than Logistic Regression.
Though LDA seeks to solve the same equation as that used in Logistic Regression, it does so in a completely different way. In Logistic Regression, we directly calculate the conditional probability provided above. We calculate our coefficients by using maximum likelihood. We then plug our coefficients into the logistic function to derive our conditional probability.
Instead of using maximum likelihood and the logistic function, LDA seeks to solve the conditional probability equation using Baye’s Theorem.
You may be wondering what I mean by collective variance. LDA, in order to calculate the likelihood, must assume some probability density function. The model assumes a Gaussian distribution. This means that the model is going to create a separate distribution of x for each k_{n} classes and each will have their own μ but share a σ^{2}.
So in short, when we fit our model, it has to create separate probability distributions for each k_{n} class and estimate the values of μk , σ2 , and πk , and plug these into the discriminate equation.
Let’s Define Our Problem Statement
Now that we’ve gotten a brief overview of LDA, let’s define our problem. We will assume that we are managing a quantitative portfolio of Statistical Arbitrage strategies. A major concern that we have is the maximum loss that we could expect to lose at any given time.
We calculate our VaR, or Value at Risk, and understand with 95% confidence the level where our returns should not drop below. But what about the other 5%, or the tail risk?
The tail risk is a major concern with VaR analysis. Intuitively speaking, though we know that our returns may fall beneath some threshold 5% of the time, what concerns us is that we don’t know exactly where within that 5% interval we can expect our returns to fall. What is the maximum loss we can expect within this 5% tail?
To tackle this problem, we will build a quantitative portfolio of Statistical Arbitrage strategies. We will use pairs within the S&P 500 created using KMeans in an early article found here: https://www.quantinsti.com/blog/kmeansclusteringpairselectionpython/.
Once we construct our portfolio, we will create our VaR and Monte Carlo and analyze the results. We will then build an LDA model that can help us better understand the probability of returns within a certain range beneath our VaR. This will help us better understand our risk, and thus manage our quantitative portfolio.
Building Our Quantitative Portfolio
We will port over our statarb class from our series on KMeans. We will use stocks we found in the first cluster of that analysis to create our Statistical Arbitrage portfolio.
In the next article of this series, Lamarcus will show us how to import our Python libraries.
If you want to learn more about Linear Discriminant Analysis in Python, or to download the code, visit QuantInsti website and the educational offerings at their Executive Programme in Algorithmic Trading (EPAT™).
This article is from QuantInsti and is being posted with QuantInsti’s permission. The views expressed in this article are solely those of the author and/or QuantInsti and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broadbased economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.
19409
By J. González, QuantDare
This article was first posted on QuantDare Blog.
When coding in any computer language, performance is always an important feature to take into consideration. But when it comes to Python, this factor becomes crucial. In this post, we will see how the way we develop a function and whether we’re using a library can make significant changes with respect to performance.
Let’s look at two possible implementations of a simple function, which apply different transformations according to the input values:
1 2 3 4 5 6 7 8 9 10 

This second function would obtain the same result as the previous one:
1 2 3 4 5 6 

Let’s define 3 equivalent variables in different types: a list of lists, a Numpy array and a DataFrame from Pandas:
1 2 3 4 5 6 

Lists
If you apply the two functions defined above to a list of lists, the first function is 5 times faster than the second, just as a consequence of the way they are coded. In function 1, only the code inside the fulfilled condition is executed, while in function 2 all the calculations are done for every figure.
1 2 

But, what if we worked with Numpy arrays instead of lists? Can we expect the same behavior?
First of all, in order to “map” the first function in Numpy we would need to vectorize it (we will use a decorator), otherwise it would not work. Vectorizing a function allows us to apply the function to the whole array, instead of using a loop.
1 2 3 4 5 6 7 8 9 10 11 12 

At first sight, we realize that the performance has improved a little with the first function, and tremendously with the second one. But what amazes the most is that now the second function is much faster than the first one! But, were we not saying that the first implementation was faster? Let’s explain what is going on here:
1 2 

Numpy has what they call the universal functions (ufunc), which are functions that can receive arraylike inputs and return array outputs, but they operate over each element. It is quite the same that we do when vectorizing, but with faster results, since these functions look over the elements by loops in a lower level (C implementations). Besides, these functions broadcast (adjust) the input arrays when they have different dimensions.
Then, the first function is “generalized” to operate like the ufuncs, but the second function does use the ufuncs.
Alright, but I don’t see any ufunc at all! Well, the different operators you see in the formulas, like *, +, &, > are overloaded with the ufuncs multiply(), add(), logical_and() or greater(). Then, for example, x+4.0 would be the same as applying np.add(x, 4.0).
To conclude, we might wonder if Pandas library would obtain similar performance as Numpy, taking into account that Pandas makes use of Numpy arrays underneath.
If we apply a map operation with the functions defined, the performance would be slower than the one obtained by mapping a list and, obviously, much slower than Numpy:
1 2 

If we apply the functions directly over Pandas DataFrames, the first vectorized function performs slower than with a Numpy array; the second function loses all its potential when applying over a DataFrame, with a performance similar to the one obtained with a list.
1 2 

Finally, to complicate matters even further, we could use the “apply ” DataFrame method which applies the function specified to entire rows or columns; as we see below, the choice of the axis to operate on is a factor that makes a big difference in terms of performance.
In general, it is advisable not to use this kind of mapping in Pandas if you want an acceptable performance.
1 2 

So, be warned: the way you implement your code and the choice of the right libraries and functions can make your programs fly or be as slow as molasses in January.
Daring to quantify the markets
This article is from QuantDare and is being posted with QuantDare’s permission. The views expressed in this article are solely those of the author and/or QuantDare and IB is not endorsing or recommending any investment or trading discussed in the article. This material is for information only and is not and should not be construed as an offer to sell or the solicitation of an offer to buy any security. To the extent that this material discusses general market activity, industry or sector trends or other broadbased economic or political conditions, it should not be construed as research or investment advice. To the extent that it includes references to specific securities, commodities, currencies, or other instruments, those references do not constitute a recommendation by IB to buy, sell or hold such security. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.
19537
Trade Ideas  It's a HOLLY & Human Market: Market Impact of Securing Alpha with Artificial Intelligence
In case you missed it! The webinar recording is available on the IBKR YouTube Channel.
https://www.youtube.com/watch?v=b5mVTaeZDWo
Markets are roiling with seismic change: Black Swan events, the return of volatility, lackluster performance, and a continued string of fund redemptions. Join a conversation on the market impact A.I. powered technology is making in both active and passive investment management. We'll discuss methodologies behind one fintech innovator (Trade Ideas) process of applying data science to market analysis.
Information posted on IBKR Quant that is provided by thirdparties and not by Interactive Brokers does NOT constitute a recommendation by Interactive Brokers that you should contract for the services of that third party. Thirdparty participants who contribute to IBKR Quant are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
19616
We appreciate your feedback. If you have any questions or comments about IBKR Quant Blog please contact ibkrquant@ibkr.com.
The material (including articles and commentary) provided on IBKR Quant Blog is offered for informational purposes only. The posted material is NOT a recommendation by Interactive Brokers (IB) that you or your clients should contract for the services of or invest with any of the independent advisors or hedge funds or others who may post on IBKR Quant Blog or invest with any advisors or hedge funds. The advisors, hedge funds and other analysts who may post on IBKR Quant Blog are independent of IB and IB does not make any representations or warranties concerning the past or future performance of these advisors, hedge funds and others or the accuracy of the information they provide. Interactive Brokers does not conduct a "suitability review" to make sure the trading of any advisor or hedge fund or other party is suitable for you.
Securities or other financial instruments mentioned in the material posted are not suitable for all investors. The material posted does not take into account your particular investment objectives, financial situations or needs and is not intended as a recommendation to you of any particular securities, financial instruments or strategies. Before making any investment or trade, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice. Past performance is no guarantee of future results.
Any information provided by third parties has been obtained from sources believed to be reliable and accurate; however, IB does not warrant its accuracy and assumes no responsibility for any errors or omissions.
Any information posted by employees of IB or an affiliated company is based upon information that is believed to be reliable. However, neither IB nor its affiliates warrant its completeness, accuracy or adequacy. IB does not make any representations or warranties concerning the past or future performance of any financial instrument. By posting material on IB Quant Blog, IB is not representing that any particular financial instrument or trading strategy is appropriate for you.
Follow us on