Google Scraping: Extract Data from Web Sources

The creation of expert knowledge bases depends largely on the knowledge of information resources. However, the documentalist may overlook key information sources due to the broad spectrum of documentary content on the web. The tools and techniques available for discovering new content (such as web crawler programs and data mining) do not always provide a comprehensive overview. This reason leads the Scientific Community to pay increasing attention to major search engines. The case under consideration concerns Google and Google Scholar, due to their relevance for developing webometric and scientometric research, as well as for generating datasets and document collections that lead to the creation of specialized big data.

If it were possible to trace and index Google’s search results, researchers could automatically create knowledge bases by downloading only those strategic resources and contents that meet their specific information needs, leveraging the querying power of the search engine. It would also be feasible to compile specialized documentation such as patents, databases, office documents, texts, and multimedia resources. To a large extent, the classification of retrieved information would be determined by the queries submitted to the search engine, providing an excellent starting point for organizing knowledge. Moreover, researchers could incorporate into their studies web sectors entirely unfamiliar to them. In the productive sphere, this would have significant implications for the development of new specialized search engines, whose development costs would be substantially low since they would not rely on proprietary server infrastructure but rather on the leading search engine. All of this without mentioning that companies offering product and service comparisons (e.g., for insurance, flights, hotels, etc.) could expand their coverage to include comparison of the search engine’s own content, rather than a selected set of websites. And yet, it is likely that applications of scraping in search engines have yet to be invented.

For all these reasons, it is evident that the technique of “web scraping” holds great relevance for the future of Documentation, both because it enables information professionals to manage web content directly and due to its socio-economic dimension, which contributes to the development and creation of new enterprises.

In order to demonstrate that it is possible to track the content and result pages of Google and leverage their information, a web scraping experiment has been developed with the objective of retrieving the content from one or several result pages. Additionally, the web scraping program has been connected to a custom web crawler system based on Mbot, which allows re-crawling and indexing the content determined by the user. Thus, the web scraping program applied to the Google search engine becomes another search engine that expands the information it receives, further enriching the original content of each webpage and website. It can be compared to a selective web crawling approach based on the user’s relevant results.

The experiment was presented at the XIII Hispano-Mexican Seminar on Library and Information Science held at the Institute of Bibliological Research of the UNAM in Mexico City, and has also been featured on the specialized blog BIBLIORed 3.0.

▶ Google scraping experiment

http://www.mblazquez.es/google2down/

Fig.1. The SERP (Search Engine Results Page) contents are retrieved by a web scraping program specifically designed for Google and Google Scholar

Fig.2. The results can be analyzed using a web crawler derived from Mbot that recognizes headings, paragraphs, links, text, and other elements on each webpage selected by the user

Fig.3. The program has been designed to work with Google Scholar, given its potential for conducting scientometric studies