Jeffrey Davis, Senior Director, Government Relations and Public Policy at BlackBerry
With BlackBerry Cylance Data Science Team
The world of mobility continues to grow and change. Automobiles, trains and traffic lights are moving from disparate parts of a loosely federated physical network to key nodes of operation on a far-reaching and virtually connected network. We are transitioning from industries driven from a centralized, top-down view to one focused on, and shaped by, the consumer.
In many ways, artificial intelligence (AI) and machine learning (ML) made this new focus a reality. From advancements in autonomous vehicles and signaling optimization to ride hailing and advanced mapping, all are made possible by forms of ML that, in some cases, operate through AI. In fact, every year machines do more and more to aid the world of transportation.
It is imperative that professionals across the mobility sector have a basic understanding of what AI and ML really are beyond buzz words, what their capabilities and limitations are, how to know when to look for an AI/ML solution and what an appropriate solution looks like.
Introduction to AI and ML applications for security
According to an October 2016 report issued by the U.S. federal government’s National Science and Technology Council Committee on Technology (NSTCC), “AI has important applications in cybersecurity, and is expected to play an increasing role for both defensive and offensive cyber measures.”[i]
But, many people still don’t understand the basics of this important advancement or how it could be applied to the cybersecurity industry.
Machine learning and the security domain
The field of AI encompasses several distinct areas of research. In recent years, though, most of the fruitful research and advancements have come from ML, the sub-discipline of AI. ML focuses on teaching machines to learn by applying algorithms to data.
The security domain generates huge quantities of data from logs, network sensors and other sources indicating various user activities.
Collectively, this mass of data can provide the contextual clues we need to identify and ameliorate threats, but only if we have tools capable of teasing them out.
By acquiring a broad understanding of the activity surrounding the assets under their control, ML systems make it possible for analysts to discern the relationship between events widely dispersed in time and across disparate hosts, users and networks. Properly applied, ML can provide the context we need to reduce the risks of a breach while significantly increasing the “cost of attack.”
The purpose of cluster analysis is to segregate data into discrete groups based on key features or attributes. Within a given cluster, data items will resemble each other more than data items within a different cluster.
In the network security domain, cluster analysis typically proceeds through a well-defined series of data preparation and analysis operations.
Statistical sampling techniques allow us to create a more manageable subset of the data for our analysis. The sample should reflect the characteristics of the total dataset as closely as possible, or the accuracy of results may be compromised.
Then we extract and analyze certain elements of the sample—features—to produce useful insights. Relevant features might include the percentage of ports that are open, closed or filtered, the application running on each of these ports and the application version numbers. If we’re investigating the possibility of data exfiltration, we might want to include features for bandwidth utilization and login times.
We would expect to see the vast majority of the resulting data grouped into a set of well-defined clusters that reflect normal operational patterns, and a smaller number of very sparse clusters, or “noise points”, that indicate anomalous user and network activity.
For security applications, we could then probe these anomalies further by grepping through our log data to match this suspect activity to possible bad actors.
In ML, classification refers to a set of computational methods for predicting the likelihood that a given sample belongs to a predefined class, like whether a piece of email belongs to the class “spam” or a network connection is benign or associated with a botnet.
The algorithms used to perform classification are referred to as “classifiers.” There are numerous classifiers available to solve classification problems, each with its own strengths and weaknesses.
To produce an accurate model, analysts need to secure a sufficient quantity of data that has been correctly sampled and categorized. This data is then typically divided into two or three distinct sets for training, validation and testing. As a rule of thumb, the larger the training set, the more likely the classifier is to produce an accurate model.
Classification via decision trees
Decision tree (DT) algorithms determine whether a data point belongs to one class or another by defining a sequence of “if-then-else” decision rules that terminate in a class prediction. Decision trees are aptly named since they use roots, branches and leaves to produce class predictions.
The DT algorithm intrinsically generates a probability score for every class prediction in every leaf based on the proportion of positive and negative samples it contains. This is computed by dividing the number of samples of either class by the total number of samples in that leaf.
Once the DT model has been built, it’s subjected to the same testing and validation procedures we described earlier for logistic regression. Once the model has been sufficiently validated, it can be deployed to classify new, unlabeled data.
Deep learning and neural networks
Deep learning is based on a fundamentally different approach that incorporates layers of processing with each layer performing a different kind of calculation. Samples are processed layer-by-layer in stepwise fashion with the output of each layer providing the input for the next. At least one of these processing layers will be “hidden.” It is this multi-layered approach, employing hidden layers, that distinguishes deep learning from all other machine learning methods.
The term deep learning encompasses a wide range of unsupervised, semi-supervised, supervised and reinforcement learning methods primarily based on the use of neural networks, a class of algorithms so named because they simulate the ways densely interconnected networks of neurons interact in the brain.
Neural networks are extremely flexible, general-purpose algorithms that can solve a myriad of problems in a myriad of ways. Unlike other algorithms, for example, neural networks can have millions, or even billions of parameters applied to define a model.
Like every important new technology, AI has occasioned both excitement and apprehension among industry experts and the popular media.
We also stand behind the idea that AI will play a positive role in our lives as long as AI research and development is guided by sound ethical principles that ensure the systems we build are fully transparent and accountable to humans.
In the near-term, however, we think it’s important for security professionals to gain a practical understanding about what AI is, what it can do, and why it’s becoming increasingly important to our careers and the ways we approach real-world security problems.
[i] Preparing for the Future of Artificial Intelligence. Executive Office of the President, National Science and Technology Council Committee on Technology. October 2016.