A previous article explained how machine learning can help Web3 security at a high-level. In this post, you’ll learn more specifically how ML can help developers build more effective Forta detection bots.
Machine learning is a versatile tool that can extract meaningful insights from large amounts of data. With open and public blockchain data, anyone can build and train machine learning models to accomplish tasks such as detecting suspicious and unusual activity on the blockchain, categorizing whale addresses, and extracting new trends.
Detection bots and machine learning can be a potent combination, since machine learning can surface Web3 security threats effectively in real-time. Here are three ways machine learning can do this and some examples to help you get started.
Machine learning can detect both common and emerging threats by studying past data. Specifically, it can extract common characteristics to identify similar behaviors in new data.
For example, ML can detect future phishing activity by analyzing past phishing attacks and transaction flow. The authors of a research paper: “Who Are the Phishers?“ gathered phishing related labels from Etherscan and CryptoScamDB, and created a ML solution that detects phishing attacks with 92.7% precision and 89.3% recall. A ML model like this can be deployed in a Forta detection bot. Block by block or for every incoming transaction, the bot can invoke the model to predict if any addresses are involved in a phishing attack.
ML can also help programs efficiently handle big data and build more performant ML models. For example, Autoencoders are neural networks that can trim superfluous information by reducing high-dimensional data to a more concise representation. They can automate feature extraction, a process that reduces raw data into numerical features. Like a good book summary that captures all the main points of a dense book, it can preserve key information about the original data. These features can then be used to train machine learning models efficiently.
Feature extraction can be done manually, but it can be slower and more error-prone. Additionally, autoencoders can generate useful but neglected features that can be missed during manual feature extraction.
In the context of Web3, autoencoders can be used to automatically transform raw transaction data into features that summarize an address or smart contract’s characteristics. With these features, a ML model can be trained to detect malicious smart contracts. A model like this can significantly improve a bot that detects new smart contracts created by Tornado cash funded accounts and prevent an attack.
Both phishing and malicious smart contract detection can help prevent wallet users from interacting with a suspicious address or fake smart contract.
Machine learning can surface potential threats with high confidence and reduce most false alarms often seen in heuristic based systems. It can extract patterns from data more efficiently than traditional methods separating normal from unusual activity. This is critical as if you receive too many alarms, you won’t have bandwidth to address them all. As you fall behind, you’ll begin to cherry-pick, finding out that most are false alarms and eventually ignore nearly everything. This cycle then of course leads to a higher likelihood of missing true positive alerts that can protect you.
Time series analysis is a statistical technique that can extract patterns found in time series data and help predict the future. This technique is commonly used in predicting futures sales and GDP growth. It can also be used to flag outlier behavior if the prediction is mis-aligned from reality. In Web3, this can be useful for detecting abnormal transaction gas usage and price changes in trading pools. In fact, two Forta community members recently created bots using time series analysis. To learn more about the implementation, please check out their blog post, Time Series Analysis with Forta.
If the data you are working with has clear seasonal variations and adequate historical data covering several seasons, time series analysis can be very effective. However, not all data may fall into this pattern. An alternative in this case can be an Isolation Forest, another powerful machine learning algorithm that can analyze high-dimensional data and isolate outliers from normal observations. This unsupervised learning technique can help detect fraudulent activity and is often used in e-commerce to spot unusual customer behavior. Isolation Forest can also be used in Web3 to detect unusual activity on the blockchain. For example, the Anomalous Token Transfers Detection Bot uses an isolation forest to detect abnormal transactions with ERC20 token transfers. To learn more about its implementation, you can find the code in this github repo.
Machine learning can take large amounts of data and condense it, allowing for algorithms and machine learning programs to produce actionable insights more efficiently.
For example, machine learning can cluster and label similar addresses based on transaction activity. It can also identify groups of addresses managed by a malicious user. This means bot developers can create more targeted logic that can ignore or filter for addresses with certain labels. For example, a developer can improve their suspicious high-value transaction detection bot by ignoring whale transactions and focusing on malicious user transactions.
Also, several ML models can be combined in a StackingClassifier to produce actionable insights on malicious blockchain activity. StackingClassifier is an ensemble learning algorithm that learns how to best combine predictions from multiple ML models. In Web3, stacking several models that each detect malicious smart contracts, EOAs, and transactions can illuminate various characteristics of transactions and entities involved.
These are just a few ways machine learning can help detect web3 security threats. If you have a model in mind, check out this comprehensive guide on how to utilize ML in detection bots. This guide shares tips and tricks on how to process model inputs/outputs, produce prediction explainability, and much more. You can also find more ML related resources such as training data sources on Forta’s ML and data science page.