일 | 월 | 화 | 수 | 목 | 금 | 토 |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 | 12 | 13 | 14 |
15 | 16 | 17 | 18 | 19 | 20 | 21 |
22 | 23 | 24 | 25 | 26 | 27 | 28 |
29 | 30 |
- Clean Code
- 자바
- 문자열
- data science methodology
- 부스트캠프
- AI Mathematics
- programmers
- 데이터과학
- Python
- 소프티어
- IBM
- 클린코드 파이썬
- string
- 코딩테스트
- 코세라
- Boostcamp AI
- 알고리즘
- 데이터사이언스
- 깨끗한 코드
- 파이썬
- 오블완
- Java
- 데이터 사이언스
- 티스토리챌린지
- Data Science
- 프로그래머스
- 클린코드
- Coursera
- softeer
- 코테
- Today
- Total
떼닝로그
Python for Data Science, AI & Development - APIs & Data Collection (2) 본문
Python for Data Science, AI & Development - APIs & Data Collection (2)
떼닝 2024. 3. 14. 14:44APIs & Data Collection
REST APIs, Webscraping, and Working with Files
REST APIs & HTTP Requests - Part 1
HTTP Protocol
HTTP Request
Overview of HTTP
- Scheme : the protocol, for this lab it will always be http://
- Internet address or Base URL : used to find the location, for exaxmple : www.ibm.com and www.gitlab.com
- Route : location on the web server, for example: /images/IDSNlogo.png
Request Message
Response Message
Status Code
HTTP Methods
REST APIs & HTTP Requests - Part 2
Request Module in Python - Requests
import requests
url = 'https://www.ibm.com/'
r = requests.get(url)
r.status_code:200
r.request.headers # 하면 request header 확인 가능
r.request.body:None # GET request에는 body 존재하지 않아서 None
header = r.headers # http response에 대한 내용을 dict type으로 보여줌
header['date']:'Thu, 19 Nov 2020 15:21:47 GMT'
header['Content-Type']:'text/html; charset=UTF-8'
r.encoding:'UTF-8'
r.text[0:100]: # display하고 싶을 때 사용 가능
GET Request with URL Parameters - GET Request
GET Request with URL Parameters - Quiery string
GET Request with URL Parameters - Create Query string
url_get = 'http://httpbin.org/get'
payload = {"name":"Joseph", "ID":"123"}
r=requests.get(url_get, params=payload)
r.url : 'http://httpbin.org/get?name=Joseph&ID=123'
r.request.body : None
r.status_code : 200
r.text # 하면 텍스트 내용 다 보여줌
r.headers['Content-Type'] : 'application/json'
r.json() # json type으로 내보내기
r.json()['args'] : {'ID':'123', 'name':'Joseph'}
POST Requests - POST
url_post = "http://httpbin.org/post"
payload = {"name":"Joseph", "ID":"123"}
r_post = requests.pot(url_post, data=payload)
POST Requests - Compare POST and GET
print("POST request URL:", r_post.url) # POST request URL : http://httpbin.org/post
print("GET request URL:", r.url) # GET request URL : http://httpbin.org/get?name=Joseph&ID=123
print("POST request body:", r_post.request.body) # POST request body : name=Joseph&ID=123
print("GET request body:", r.request.body) # GET request body : None
r_post.json()['form'] # {'ID':'123', 'name':'Joseph'}
HTML for Webscraping
HTML Tags
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary : $ 92,000,000 </p>
<h3> Stephen Curry </h3>
<p> Salary : $ 85,000,000 </p>
<h3> Kevin Durant </h3>
<p> Salart : $73,200,000 </p>
</body>
</html>
Composition of an HTML Tag - HTML Anchor Tag
Composition of an HTML Tag - Hyperlink Tag
Composition of an HTML Tag - Opening and End Tags
Composition of an HTML Tag - Hyperlink Content
Composition of an HTML Tag - Attributes
HTML Trees - Document Tree
HTML Tables
<table>
<tr>
<td>Pizza Place</td>
<td>Orders</td>
<td>Slices</td>
</tr>
<tr>
<td>Domino Pizza</td>
<td>10</td>
<td>100</td>
</tr>
<tr>
<td>Little Caesars</td>
<td>12</td>
<td>144</td>
</tr>
</table>
Webscraping
What is Webscraping?
- process that can be used to automatically extract information from a website
- can easily be accomplished within a matter of minutes
Beautiful Soup
from bs4 import BeautifulSoup
html="<!DOCTYPE html><html><head><title>Page Title</title></head>
<body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $92,000,000
</p><h3>Stephen Curry</h3><p> Salary: $85,000,000 </p><h3>Kevin Durant</h3>
<p> Salary : $73,200,000 </p></body></html>"
soup = BeautifulSoup(html, 'html5lib')
Beautiful Soup Objects - Tag Object
Beautiful Soup Objects - HTML Tree
Beautiful Soup Objects - Parent attribute
Beautiful Soup Obejcts - Next-sibling attribute
Beautiful Soup Objects - Navigable string
find_all - Python iterable
find_all - Tag Object
first_row = table_row[0]
first_row :
<tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr>
first_row.td:
<td>Pizza Place</td>
find_all - Variable Row
for i, row in enumerate(table_rows):
print("row", i)
cells = row.find_all("td")
for j, cell in enumerate(cells):
print("column", j, "cell", cell)
find_all - Elements
for i, row in enumerate(table_rows):
print("row", i)
cells = row.find_all("td")
for j, cell in enumerate(cells):
print("column", j, "cell", cell)
A Webpage Example - Overview Python Program
import requests
from bs4 import BeautifulSoup
page = requests.get("http://EnterWebsiteURL...").text
# Creates a BeautifulSoup Object
soup = BeautifulSoup(page, "html.parser")
# Pulls all instances of <a> tag
artists = soup.find_all('a')
# Clears data of all tags
for artist in artists:
names = artist.contents[0]
fullLink = artist.get('href')
print(names)
print(fullLink)
Reading : Webscraping
How web scraping works:
- HTTP Request : A web scraper sends an HTTP request to a specific URL, similar to how a web browser would when you visit a website. The request is usually an HTTP GET request, which retrieves the content of the web page.
- Web Page Retrieval : This content includes not only the visible text and media elements but also the underlying HTML structure that defines the page's layout
- HTML Parsing : Parsing involves breaking down the HTML structure into its individual components, such as tags, attributes, and text content. It creates a structured representation of the HTML content that can be easily navigated and manipulated
- Data Extraction : Scrapers locate the data by searching for relevant HTML tags, attributes, and patterns in the HTML structure
- Data Transformation : Extracted data may need further processing and transformation. This step ensures that the data is ready for analysis or other use cases.
- Storage : The choice of storage format depends on the specific project's requirements
- Automation : Automation tools allow for recurring data extraction from multiple web pages or websites. Automated scraping is especially useful for collecting data from dynamic websites that regularly update their content.
HTML Structure
- <html> is the root element of an HTML page
- <head> contains meta-information about the HTML page
- <body> displays the content on the web page, often the data of interest
- <h3> tags are type 3 headings, making text larger and bold, typically used for player names
- <p> tags represent paragraphs and contain player salary information
Composition of an HTML Tag
- An HTML tag consists of an opening (start) tag and a closing (end) tag
- Tags have names (e.g., <a> for an anchor tag)
- Tags may contain attributes with an attribute name and value, providing additional information to the tag
HTML Document Tree
- Tags can contain strings and other tags, making them the tag's children
- Tags within the same parent tag are considered siblings
- For example, the <html> tag contains both <head> and <bpdy> tags, making them descendants of <htm> but children of <html>. <head> and <body> are siblings
HTML Tables
- Define an HTML table using the <table> tag
- Each table row is defined with a <tr> tag
- The first row often uses the table header tag, typically <th>
- The table cell is represented by <td> tags, defining individual cells in a row
Web Scraping
- Required Tools : Requests and Beautiful Soup
# Import Beautiful Soup to parse web page content
from bs4 import BeautifulSoup
- Fetching and Parsing HTML
import requests
from bs4 import BeautifulSoup
# Specify the URL of the webpage you want to scrape
url = 'https://en.wiikipedia.org/wiki/IBM'
# Send an HTTP GET request to the webpage
response = requests.get(url)
# Store the HTML content in a variable
html_content = response.text
# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')
# Display a snippet of the HTML content
print(html_content[:500])
- Naviagting the HTML Structure
# Find all <a> tags (anchor tags) in the HTML
links = soup.find_all('a')
# Iterate through the list of links and print their text
for link in links:
print(link.text)
- Custom Data Extraction : Web scraping allows you to navigate the HTML structure and extract specific information based on your requirements. This may involve finding specific tags, attributes, or text content within the HTML document
- Using BeautifulSoup for HTML Parsing : BeautifulSoup allows you to find elements based on their tags, attributes, or text, making easier to extract the infromation you're interested in.
- Using pandas read_html for Table Extraction : pandas.read_html can automatically extract data from these tables and present it in a format suitable for analysis. It's similar to taking a table from a webpage and importing it into a spreadsheet for further analysis