Кто читает:: Кафедра прикладной математики и информатики (Нижний Новгород) (Факультет информатики, математики и компьютерных наук (Нижний Новгород))

Статус:: Курс обязательный

Когда читается:: 2-й курс, 4 модуль

Преподаватель

Савченко Андрей Владимирович

Full Syllabus Ask Question

Abstract

Data Scraping is importing information from a website, spreadsheets, PDF's and other data sources. Using machine learning methods without a well-prepared dataset will not lead to good results. Qualitatively prepared datasets suitable for machine learning algorithms are a rarity. Automating the preparation of such data sets is the task of data scraping. The course examines the issues of text file encoding, network interaction with web servers, the basics of the HTML hypertext markup language, XML and JSON data storage and exchange formats, interaction with servers using the API, and work with non-static sites. The course uses Python and its libraries to access data. At the end of the course, students will implement a data scraping project.

Learning Objectives

Learn to process excel/xml/json/pdf files using Python
Learn ip, dns, http. GET- and POST- requests
Learn HTML basics
Learn to implement BeautifulSoup library, automatization with Selenium
Learn to use API's

Expected Learning Outcomes

Learn most popular encodings
Change encoding of a text from one to another
Navigate through JSON & XML
Extract text and images from PDF
Apply regular expressions
Understand HTML
Create a simple HTML-page
Understand CSS
Analyze the connection between HTML and CSS
Create a more complicated HTML page
Apply CSS to add style to HTML page
Analyze HTTP protocol message format
Learn about Python Web-Tools
Apply Python requests module
Apply Python requests module to deal with headers, user-sessions, POST-requests, files
Apply Python BeautifulSoup module to scrape static pages
Analyze the difference between static and dynamic pages
Understand Silenium library capabilities, its functions and methods
Apply Silenium library to scrape data from a dynamic page
Recognize the concept of Web-API
Contrast the process of scraping via Web-API and via page source
Examine the process of web-development
Create your own simple web-service & web-API
Implement a scraping script from scratch
Understand legal & ethical nuances of data scraping

Course Contents

1. Character Encodings
2. Popular File Formats
3. Regular Expressions and HTML
4. HTML and CSS
5. Internet
6. Scraping HTML
7. Selenium
8. Web API
9. Web development 101
10. Practice

Assessment Elements

Тест
Тест
Экзамен

Interim Assessment

2022/2023 4th module
0.5 * Экзамен + 0.25 * Тест + 0.25 * Тест

Bibliography

Recommended Core Bibliography

Matt West and Matt West - HTML5 Foundations - John Wiley & Sons, Incorporated , 2012-386 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=1120310

Recommended Additional Bibliography

Ian Pouncey and Richard York - Beginning CSS : Cascading Style Sheets for Web Design - John Wiley & Sons, Incorporated, 2011-466 - Текст электронный - https://ebookcentral.proquest.com/lib/hselibrary-ebooks/detail.action?docID=693510

Authors

Литвишкина Ален Витальевна

Магистерская программа «Искусственный интеллект и компьютерное зрение»

Email

Адрес

Телефон

Research Seminar "Data Scraping"