DARKBert: AI Tool to Fight Cybersecurity Risks from the Dark Web
Over the last few months, Large Language Models (LLMs) have become increasingly popular, particularly with the rise of AI chatbots like ChatGPT. These models use AI to create fresh content, including text, images, and audio, by analyzing existing data and identifying patterns to generate unique content. Although these tools have been utilized for generative AI content creation, researchers have now created a novel LLM that can detect and counter cybersecurity risks. Notably, this model has been exclusively trained on data from the dark web.
What is DarKBERT?
DarkBERT is an encoder model that implements the RoBERTa architecture and is based on transformers. Instead of training online, the researchers trained this LLM on a large dataset of the dark web, absorbing information from places like hacker forums, scam sites, and other criminal Internet sources. In the yet-to-be-peer-reviewed paper “DarkBERT: A Language model for the dark side of the Internet,” published on Arxiv.org, its authors say DarKBERT can revolutionize the fight against cybercrime by finding and analyzing hard-to-detect domains. from the Internet, which remain hidden from search engines.
Although the dark web is usually hidden and inaccessible to the general public, researchers used the Tor network to access and collect data from its pages. The data was then subjected to several processes such as deduplication, category balancing and pre-processing to create a refined database from the dark web, which was then finally fed to RoBERTa, resulting in the creation of DarKBERT over a period of 15 days.
Cyber Security Applications
Since DarKBERT is trained on a dataset of dark web pages, it has potential for a wide range of cybersecurity applications. It can help monitor illegal activity and strengthen cybersecurity measures. It can also “combat the extreme lexical and structural diversity of the dark web, which can be detrimental to building an accurate representation of the domain,” according to the research paper.
It can automate the process of monitoring dark web forums where illegal information is usually shared. DarKBERT can detect websites involved in leaking sensitive or confidential information and selling ransomware.
Finally, it uses the padding mask function of the BERT family language model to identify and filter phrases related to criminal activity, which can help identify and counter emerging cyber threats.