Developing a Free and Reliable Google Scholar Data Collection Framework with CAPTCHA Bypass
This study develops a framework for web scraping Google Scholar utilizing user-inputted search queries and advanced search parameters. It confronts and finds solutions to two traditional challenges of web scraping. First, most sites implement anti-bot measures, prompting Completely Automated Public Turing tests to tell Computers and Humans Apart (CAPTCHAs), which block the usage of the website by an automated system. Second, all publicly available solutions to CAPTCHAs in web scraping are either unreliable or expensive. The proposed framework employs the Python package Selenium, resulting in a free and reliable framework to collect data through Google Scholar with the ability to manually bypass CAPTCHAs. It also exports the data into an easily usable format. The outcome of the search queries is extensive and includes information of journals, articles, and authors, publication statistics, and citation data. This contributes to the rapidly growing data extraction practice.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.