• A
  • A
  • A
  • АБВ
  • АБВ
  • АБВ
  • А
  • А
  • А
  • А
  • А
Обычная версия сайта

Research Seminar "Data Scraping"

2022/2023
Учебный год
ENG
Обучение ведется на английском языке
6
Кредиты

Course Syllabus

Abstract

Data Scraping is importing information from a website, spreadsheets, PDF's and other data sources. Using machine learning methods without a well-prepared dataset will not lead to good results. Qualitatively prepared datasets suitable for machine learning algorithms are a rarity. Automating the preparation of such data sets is the task of data scraping. The course examines the issues of text file encoding, network interaction with web servers, the basics of the HTML hypertext markup language, XML and JSON data storage and exchange formats, interaction with servers using the API, and work with non-static sites. The course uses Python and its libraries to access data. At the end of the course, students will implement a data scraping project.
Learning Objectives

Learning Objectives

  • Learn to process excel/xml/json/pdf files using Python
  • Learn ip, dns, http. GET- and POST- requests
  • Learn HTML basics
  • Learn to implement BeautifulSoup library, automatization with Selenium
  • Learn to use API's
Expected Learning Outcomes

Expected Learning Outcomes

  • Learn most popular encodings
  • Change encoding of a text from one to another
  • Navigate through JSON & XML
  • Extract text and images from PDF
  • Apply regular expressions
  • Understand HTML
  • Create a simple HTML-page
  • Understand CSS
  • Analyze the connection between HTML and CSS
  • Create a more complicated HTML page
  • Apply CSS to add style to HTML page
  • Analyze HTTP protocol message format
  • Learn about Python Web-Tools
  • Apply Python requests module
  • Apply Python requests module to deal with headers, user-sessions, POST-requests, files
  • Apply Python BeautifulSoup module to scrape static pages
  • Analyze the difference between static and dynamic pages
  • Understand Silenium library capabilities, its functions and methods
  • Apply Silenium library to scrape data from a dynamic page
  • Recognize the concept of Web-API
  • Contrast the process of scraping via Web-API and via page source
  • Examine the process of web-development
  • Create your own simple web-service & web-API
  • Implement a scraping script from scratch
  • Understand legal & ethical nuances of data scraping
Course Contents

Course Contents

  • 1. Character Encodings
  • 2. Popular File Formats
  • 3. Regular Expressions and HTML
  • 4. HTML and CSS
  • 5. Internet
  • 6. Scraping HTML
  • 7. Selenium
  • 8. Web API
  • 9. Web development 101
  • 10. Practice
Assessment Elements

Assessment Elements

  • non-blocking Тест
  • non-blocking Тест
  • non-blocking Экзамен
Interim Assessment

Interim Assessment

  • 2022/2023 4th module
    0.5 * Экзамен + 0.25 * Тест + 0.25 * Тест
Bibliography

Bibliography

Recommended Core Bibliography

  • Matt West and Matt West - HTML5 Foundations - John Wiley & Sons, Incorporated , 2012-386 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=1120310

Recommended Additional Bibliography

  • Ian Pouncey and Richard York - Beginning CSS : Cascading Style Sheets for Web Design - John Wiley & Sons, Incorporated, 2011-466 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=693510