Developing a Free and Reliable Google Scholar Data Collection Framework with CAPTCHA Bypass

Authors

  • SHREYAN DEY Department of Geographic and Geoinformation Science, George Mason University, Fairfax, VA
  • Alireza Ermagun Department of Geographic and Geoinformation Science, George Mason University, Fairfax, VA

DOI:

https://doi.org/10.13021/jssr2023.3879

Abstract

This study develops a framework for web scraping Google Scholar utilizing user-inputted search queries and advanced search parameters. It confronts and finds solutions to two traditional challenges of web scraping. First, most sites implement anti-bot measures, prompting Completely Automated Public Turing tests to tell Computers and Humans Apart (CAPTCHAs), which block the usage of the website by an automated system. Second, all publicly available solutions to CAPTCHAs in web scraping are either unreliable or expensive. The proposed framework employs the Python package Selenium, resulting in a free and reliable framework to collect data through Google Scholar with the ability to manually bypass CAPTCHAs. It also exports the data into an easily usable format. The outcome of the search queries is extensive and includes information of journals, articles, and authors, publication statistics, and citation data. This contributes to the rapidly growing data extraction practice.

Published

2023-10-27

Issue

Section

College of Science: Department of Geography and Geoinformation Science

Categories