Developing a Free and Reliable Google Scholar Data Collection Framework with CAPTCHA Bypass

SHREYAN DEY; Alireza Ermagun

doi:10.13021/jssr2023.3879

Authors

SHREYAN DEY Department of Geographic and Geoinformation Science, George Mason University, Fairfax, VA
Alireza Ermagun Department of Geographic and Geoinformation Science, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2023.3879

Abstract

This study develops a framework for web scraping Google Scholar utilizing user-inputted search queries and advanced search parameters. It confronts and finds solutions to two traditional challenges of web scraping. First, most sites implement anti-bot measures, prompting Completely Automated Public Turing tests to tell Computers and Humans Apart (CAPTCHAs), which block the usage of the website by an automated system. Second, all publicly available solutions to CAPTCHAs in web scraping are either unreliable or expensive. The proposed framework employs the Python package Selenium, resulting in a free and reliable framework to collect data through Google Scholar with the ability to manually bypass CAPTCHAs. It also exports the data into an easily usable format. The outcome of the search queries is extensive and includes information of journals, articles, and authors, publication statistics, and citation data. This contributes to the rapidly growing data extraction practice.

Developing a Free and Reliable Google Scholar Data Collection Framework with CAPTCHA Bypass

Authors

DOI:

Abstract

Published

Issue

Section

Categories

License

assip