Discovering Scammer Networks with Machine Learning

Article by Forta Network May. 16, 2023

Community Spotlight: This article was guest authored by Forta community member and Senior Data Scientist, Carlos Salort.

Forta is a real-time detection network for security monitoring of blockchain activity. The decentralized Forta Network scans all transactions and block-by-block state changes, leveraging machine learning to detect threats and anomalies on DeFi, NFTs, bridges, governance and other Web3 systems. When an issue is detected, alerts are sent to subscribers of potential risks, which enables them to take action.


The technique of machine learning, and in particular Deep Learning, is steadily gaining relevance as a tool for threat detection. Over the past year, the Forta community has been rolling out machine learning bots and resources for developers to build their own machine learning based models.

In this blog post we will present the latest bot developed by a member of the Forta Community. The main idea for this bot is to use the transfers made between addresses as information to find addresses belonging to scammers, but still not flagged as such. The addresses labeled by the Forta Community are used as the base for the propagation, and new addresses can be flagged before they are involved in malicious activity. We will explain this new approach in a general way, without going too deep into the technical details.

The scammer label propagation bot runs based on two techniques: Graph neural networks, a type of deep learning models that use the additional information that can be found in graphs to improve its predictions capabilities; and semi-supervised learning, a type of machine learning used when each training observation does not necessarily have a label.

Graph Neural Networks

In order to use graph neural networks, we need to generate a graph. This bot will generate a graph around a known scammer. That is, once a scammer is labeled by the Forta Network with a high level of confidence, the bot will try to find all the other potential addresses which belong to the scammers.

Once the scammer has been labeled, the first step is to obtain all the addresses that had some transaction with the central scammer. For the first version of this bot, a transaction only covers direct ETH transactions and ERC20 tokens transactions. For each of the addresses found, we collect a summary of their complete transaction history. This includes number of transactions in/out, average value, max value and total value for both eth and ERC20 tokens. Each of the addresses will be a node in our graph, and the summary is what we will use as node features.

After compiling the complete list of nodes, we need to obtain the edges between them. We will use a directional graph, that is, the origin and destination of the edge are relevant. This is important to understand who is generating a transaction and who is receiving it. We collect all the ETH and ERC20 transactions between any two nodes in our graph. All of these will be connected to the central scammer, but there will be extra transactions between the other nodes. Similarly to the nodes, we also collect some information about each of the edges. Instead of getting a summary of the whole address, we get a summary from all the transactions corresponding to the particular edge. For example, if there is an edge going from node A to node D, we will collect the number of transactions, total, average and max value sent from A to D. We will call this data edge features.

We will use the node features and the edge features as the input for our model. For this application, we are using a custom Graph Neural Network. The model consists of two layers of TransformerConvolution, and two dense layers. The model is implemented in Python, using PyTorch and PyTorch geometric for the graph layers. The architecture of the model looks as follows:

Now we have prepared the input data, and the model that will make the predictions. But we are still missing a critical component: The labels of the addresses. Without them, the model can’t learn to differentiate between victims and attackers.

Semi-supervised learning

Once we have the list of nodes and edges, we will use the labels from the Forta Network to train the model. These labels are generated by other bots in the network, and they can be obtained using an API. After querying the labels, the graph will look like this:

We can see that we have labels in five of the nodes. In a real graph, the percentage of nodes with labels would be much lower (in relation to the total number of nodes). This presents a problem: Our model can’t be trained using supervised learning, a learning method based on training the model in a complete set of data and letting it predict in previously unseen data. 

To bypass this problem, we will use semi-supervised learning. In this case, we will train the model using the small subset of labels that we have. With the information gained from that data, then the model will make predictions in the remaining nodes of the graph, effectively transferring the information learned about the relations between known nodes to the rest.

After the training process, the model is able to generate predictions. The final graph looks like the following image, where the intermittent-lined nodes are the predictions. This model would mark two new attackers, node E (due to how similar it is to node J which was a labeled attacker) and node H (similar to D). 

Conclusion

In this post we saw how more advanced deep learning algorithms can be used to help defend against threats, even before they happened, just by analyzing the relations between addresses. The code for the model, and everything discussed in this post can be found in Github and the live beta version of this bot is available here