A hands on guide to web scraping and text mining for both beginners and experienced users of R Introduces fundamental concepts of the main architecture of the web and databases and covers HTTP, HTML, XML, JSON, SQL. Provides basic techniques to query web documents and data sets (XPath and regular expressions). An extensive set of exercises are presented to guide the reader through each technique. Explores both supervised and unsupervised techniques as well as advanced techniques such as data scraping and text management. Case studies are featured throughout along with examples for each technique presented. R code and solutions to exercises featured in the book are provided on a supporting website.
Preface xv 1 Introduction 1 1.1 Case study: World Heritage Sites in Danger 1 1.2 Some remarks on web data quality 7 1.3 Technologies for disseminating, extracting, and storing web data 9 1.3.1 Technologies for disseminating content on the Web 9 1.3.2 Technologies for information extraction from web documents 11 1.3.3 Technologies for data storage 12 1.4 Structure of the book 13 Part One A Primer onWeb and Data Technologies 15 2 HTML 17 2.1 Browser presentation and source code 18 2.2 Syntax rules 19 2.2.1 Tags, elements, and attributes 20 2.2.2 Tree structure 21 2.2.3 Comments 22 2.2.4 Reserved and special characters 22 2.2.5 Document type definition 23 2.2.6 Spaces and line breaks 23 2.3 Tags and attributes 24 2.3.1 The anchor tag 24 2.3.2 The metadata tag 25 2.3.3 The external reference tag 26 2.3.4 Emphasizing tags , , 26 2.3.5 The paragraphs tag 27 2.3.6 Heading tags , , , 27 2.3.7 Listing content with , , and 27 2.3.8 The organizational tags and 27 2.3.9 The tag and its companions 29 2.3.10 The foreign script tag