Classifying E-petitions using BERT


Introduction:

The project focuses on training a BERT-based model for text classification specific to Japanese e-petitions. The goal is to classify petitions into 6 classes predefined by the data source. This type of classification allows for assisting organizations in their understanding of online petitions, which can be crucial for various societal causes in Japan.


Google BERT Update and Why It Matters - SEO Update

Data Acquisition

To begin the project, data for training the BERT model was required, and since no Japanese data was available at the time, I decided to gather it myself. The data was collected through web scraping of the change.org/ja website and the Japanese regions' change.org. 

The Python code leverages Selenium, a web automation tool, to interact with the webpage, extract relevant data, and save it for further processing. As the default browser on the device was Brave, the scraper was also designed to use the Brave browser as the web driver. While scraping, the scraper implements a looped scrolling mechanism to load more content dynamically by clicking a "See More" button. This process continues until the "See More" button is no longer found, or when an error occurs. 

The scraped data included three parts, the title, descriptions, and the category the petition belonged to. The total data gathered was about 2000, with 6 different classes predefined by the website source. 



Approach and Methods Used:

The approach to the data initially was to preprocess them to be prepared for the model to train on. This included the cleaning of the datasets, by eliminating duplicates and or data lines with errors when web scraped.

 The next step was to take account of the disparity in the classes' balance by oversampling the underrepresented classes. The original data showed a significant class imbalance, which could have skewed the model's ability to generalize across all categories. To address this, oversampling techniques were employed, by duplicating the underrepresented classes to ensure that each class reached around 1000 samples.  

This option was chosen due to the lack of data that was able to be scraped from the website, as there weren't as many Japanese online petitions listed within the website compared to other regions. With these processes, all classes were set to around 1000 points, a total of 6000 data points.  


Model Training:

For model training, you used the BERT Base Japanese model from Hugging Face, equipped with a Japanese-specific tokenizer for handling the unique structure of the language (e.g., kanji, kana). The training configuration included:


Epochs: 10, to ensure enough iterations for the model to converge.

Batch Size: 16 per device, chosen to balance computational efficiency and performance.


The training was also conducted on BERT Base Japanese v3, a more modern variant, to compare performance and resource utilization. Both models were trained on the same balanced dataset, following identical preprocessing steps to ensure a fair comparison.

The same training process was done on another mode modern version of the model "BERT base Japanese v3", to assess the difference in the model's performance. 


Results:


The results of the two models are shown above, and by comparing the two graphs the results suggest that both the models performed very well, with around 90-92% accuracy after the model converges after around the 4th epoch. Though the performance may have been similar, where the two models had a major difference in the computational time and resources required. The computational time and resources of the newer v3 model were estimated to require 3 times that of the predecessor, thus suggesting that the improved model did not have an effect on the training process of classifying e-petition data. 

Comments

Popular posts from this blog

Home Network Simulation in Cisco Packet Tracer

SOC Automation Project - Wazuh, TheHive, Shuffle