Classifying E-petitions using BERT
Introduction:
The project focuses on training a BERT-based model for text classification specific to Japanese e-petitions. The goal is to classify petitions into 6 classes predefined by the data source. This type of classification allows for assisting organizations in their understanding of online petitions, which can be crucial for various societal causes in Japan.
Data Acquisition
To begin the project, data for training the BERT model was required, and since no Japanese data was available at the time, I decided to gather it myself. The data was collected through web scraping of the change.org/ja website and the Japanese regions' change.org.
The Python code leverages Selenium, a web automation tool, to interact with the webpage, extract relevant data, and save it for further processing. As the default browser on the device was Brave, the scraper was also designed to use the Brave browser as the web driver. While scraping, the scraper implements a looped scrolling mechanism to load more content dynamically by clicking a "See More" button. This process continues until the "See More" button is no longer found, or when an error occurs.
The scraped data included three parts, the title, descriptions, and the category the petition belonged to. The total data gathered was about 2000, with 6 different classes predefined by the website source.
Approach and Methods Used:
The approach to the data initially was to preprocess them to be prepared for the model to train on. This included the cleaning of the datasets, by eliminating duplicates and or data lines with errors when web scraped.
The next step was to take account of the disparity in the classes' balance by oversampling the underrepresented classes. The original data showed a significant class imbalance, which could have skewed the model's ability to generalize across all categories. To address this, oversampling techniques were employed, by duplicating the underrepresented classes to ensure that each class reached around 1000 samples.
This option was chosen due to the lack of data that was able to be scraped from the website, as there weren't as many Japanese online petitions listed within the website compared to other regions. With these processes, all classes were set to around 1000 points, a total of 6000 data points.
Model Training:
For model training, you used the BERT Base Japanese model from Hugging Face, equipped with a Japanese-specific tokenizer for handling the unique structure of the language (e.g., kanji, kana). The training configuration included:
• Epochs: 10, to ensure enough iterations for the model to converge.
• Batch Size: 16 per device, chosen to balance computational efficiency and performance.
The training was also conducted on BERT Base Japanese v3, a more modern variant, to compare performance and resource utilization. Both models were trained on the same balanced dataset, following identical preprocessing steps to ensure a fair comparison.
The same training process was done on another mode modern version of the model "BERT base Japanese v3", to assess the difference in the model's performance.
Results:
Comments
Post a Comment