Topological Data Analysis

I was privileged to receive training and apply Ayasdi to some business problems. It’s USP is on the use and visualisation of Topological Data Analysis – an approach to put similar data points “near” each other, connecting them as edges and nodes. There are a very many ways to consider the notion of distance, and TDA takes advantage of the “kernel trick” to make them happen (the same trick employed in SVMs). The calculations for some would ordinarily be far too involved or even impossible, but the kernel trick allows high-dimensionality calculations to be reduced to an inner product (a la Mercer’s Theorem). This is great because it opens up many more areas of exploration. Consider: instead of just the dimensions X and Y, we can transform them into higher dimensions of space with ease; we can now also try weird and wonderful new features like x7 y3 + x2 y9, as long as we can express the final function as a dot product.

The final visuals bear some similarity to SOMs (Self Organising Maps) – but TDA has much more flexibility. SOMs are limited by the way it works; it has a pre-defined grid that in the end must remain fully connected just as it began. There can be sparse nodes or forced nodes. I found this article which gives some more thought behind the issue:

2018026 som training

The results get interesting when there are clear anomalies such as a hot spot, a flare, or an island – it’s a very visual and exploratory approach to analysis; a contrast to machine learning. I was certainly convinced by the technique and would add it as another tool in the box. Have a look here for some examples of the visuals on Ayasdi’s site:

I did experience a limitation. Although there are ways to automate evaluation through measuring KS-statistic, or numbers of edges and nodes, or a custom F-stat, I observed that the best results really came from eyeballing the visual. This restricted the usefulness of the tool in a production environment. Thinking of it as a segmentation tool. it can deploy segments, but not reliably conduct online segmentation – e.g. updating segmentation rules as a batch of strange new data points come in.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s