The following is the content of the paper on the Al language model specialized in Darkweb domain, "DarkBERT," presented by S2W at the world's top-tier AI conference, 'ACL (Association for Computational Linguistics)'.
*The ACL conference is a premier academic event in the field of computational linguistics and natural language processing (NLP). It is one of the most prestigious conferences where researchers, practitioners, and academics present their latest findings, innovations, and advancements in NLP and related areas.
DarkBERT: A Language Model for Dark Side of the Internet
We’re excited to talk about our paper DarkBERT: A Language Model for the Dark Side of the Internet, which was accepted into ACL 2023. Developed by the S2W AI Team in collaboration with KAIST DarkBERT is the latest addition to S2W’s dark web research.
1. What is DarkBERT?
DarkBERT is a language model that was trained by S2W with its vast collection of Dark Web data. While other similarly constructed encoder language models struggle with the extreme lexical and structural diversity of Dark Web language, DarkBERT has been specifically trained to comprehend the illicit content of the Dark Web. DarkBERT further trains the RoBERTa model with masked language modeling (MLM) of texts collected from the Dark Web.
Encoder models represent natural language text into semantic representation vectors, which can be used for a multitude of tasks. DarkBERT is unique in that it was trained on dark web data, which allows it to outperform counterpart models on tasks of monitoring or interpreting dark web content.
The corpus collection is a fundamental challenge in training DarkBERT. S2W, renowned for its capabilities of collecting and analyzing Dark Web data, amassed a large dark web text corpus fit for training. The quality of the corpus was refined by removing pages that were redundant, duplicates, or had low information density. Even after filtering, we had a sizable corpus of 5.83 GB.
2. Background
The dark web is a part of the internet that requires specific protocols. These protocols allow the dark web to be anonymous, difficult to access, and difficult to control. These characteristics are often exploited by cybercriminals, who use it to host underground markets and share illegal content. The same characteristics also provide challenges for efforts to monitor the dark web.
S2W has a history of monitoring and researching the dark web, bringing insights to the phishing (Doppelgängers on the Dark Web: A Large-scale Assessment on Phishing Hidden Web Services) and language (Shedding New Light on the Language of the Dark Web) of the dark web. You can also find updates on notable incidents on the dark web on this blog.
3. Training Process of the DarkBERT Model
Pretrained language models (PLM) have been very powerful, but their effectiveness on the dark web has been challenged. After all, the languages of the dark web and surface web are quite different. Will BERT, trained on the surface web, be optimized to understand dark web language? What if we trained a BERT-like transformer model on the dark web domain?
A critical challenge in creating a pre-trained language model (PLM) is getting the training corpus. The dark web can be notoriously difficult to capture, but S2W’s collection capabilities allowed us to get a sizable collection of dark web text. From our previous research on dark web language, we realized that parts of the data could be unsuitable for training. Therefore, we filter the corpus by removing low Information pages, balancing according to category, and deduplicating pages. We also utilize preprocessing to anonymize common identifiers and potentially sensitive information. In the end we had an 5.83 GB unprocessed corpus and a 5.20GB processed corpus.
DarkBERT was trained starting with the RoBERTa base model, which was trained on more data for a longer time than BERT. We follow RoBERTa’s hyperparameters, and follow RoBERTa by training on the MLM task. We monitored loss and stopped training around 20K steps.
In total, training DarkBERT for approximately 15 days on 8 NVIDIA A100 GPUs. The model is available at request so you don’t have to.
4. How we use DarkBERT?
1) Dark Web Page Classification: The Dark Web is home to numerous pages full of explicit content dedicated to different types of cybercrime. Automatically classifying pages based on their content is invaluable for timely Dark Web intelligence. DarkBERT achieves state-of-the-art performance on the dark web page classification task, which aims to classify webpage content into topics such as Pornography, Hacking, and Violence. Our page classification schema is described in Shedding New Light on the Language of the Dark Web.
2) Ransomware Leak Site Detection: Ransomware-operating cybercriminals often operate “leak sites” to publish confidential data of uncooperative victim companies. Finding these websites quickly is crucial to gather intelligence of high-profile ransomware groups. DarkBERT achieved state-of-the-art performance in automatically detecting leak sites.
3) Noteworthy Thread Detection: Underground forums serve as platforms for sharing and selling information related to various illegal activities. Monitoring forums is challenging since there are countless users that can make posts on any topic. Filtering posts to find noteworthy threads (such as those selling/sharing confidential information or malicious hacking tools) is essential for effective monitoring. DarkBERT achieved state-of-the-art performance in automatically detecting noteworthy forum threads.
4) Threat Keyword Inference: Some familiar words may have a completely different meaning in the dark web. DarkBERT is trained to understand slang and explicit language used by cybercriminals, allowing us to understand word usage in dark web contexts.
✅ Read Tech Blog in detail: https://bit.ly/3WI56Vk
If you have any questions, please don't hesitate to contact us!