떼닝로그

Python for Data Science, AI & Development - APIs & Data Collection (2) 본문

Coursera/IBM Data Science

Python for Data Science, AI & Development - APIs & Data Collection (2)

떼닝 2024. 3. 14. 14:44

APIs & Data Collection

REST APIs, Webscraping, and Working with Files

REST APIs & HTTP Requests - Part 1

HTTP Protocol

HTTP Request

 

Overview of HTTP

- Scheme : the protocol, for this lab it will always be http://

- Internet address or Base URL : used to find the location, for exaxmple : www.ibm.com  and www.gitlab.com  

- Route : location on the web server, for example: /images/IDSNlogo.png

 

Request Message

Response Message

Status Code

HTTP Methods

 

REST APIs & HTTP Requests - Part 2

Request Module in Python - Requests

import requests
url = 'https://www.ibm.com/'
r = requests.get(url)
r.status_code:200

r.request.headers	# 하면 request header 확인 가능
r.request.body:None	# GET request에는 body 존재하지 않아서 None
header = r.headers	# http response에 대한 내용을 dict type으로 보여줌

header['date']:'Thu, 19 Nov 2020 15:21:47 GMT'
header['Content-Type']:'text/html; charset=UTF-8'
r.encoding:'UTF-8'
r.text[0:100]:	# display하고 싶을 때 사용 가능

 

GET Request with URL Parameters - GET Request

 

GET Request with URL Parameters - Quiery string

GET Request with URL Parameters - Create Query string

url_get = 'http://httpbin.org/get'
payload = {"name":"Joseph", "ID":"123"}
r=requests.get(url_get, params=payload)
r.url : 'http://httpbin.org/get?name=Joseph&ID=123'
r.request.body : None
r.status_code : 200
r.text	# 하면 텍스트 내용 다 보여줌
r.headers['Content-Type'] : 'application/json'
r.json()	# json type으로 내보내기
r.json()['args'] : {'ID':'123', 'name':'Joseph'}

 

POST Requests - POST

url_post = "http://httpbin.org/post"
payload = {"name":"Joseph", "ID":"123"}
r_post = requests.pot(url_post, data=payload)

 

POST Requests - Compare POST and GET

print("POST request URL:", r_post.url)	# POST request URL : http://httpbin.org/post
print("GET request URL:", r.url)	# GET request URL : http://httpbin.org/get?name=Joseph&ID=123

print("POST request body:", r_post.request.body)	# POST request body : name=Joseph&ID=123
print("GET request body:", r.request.body)	# GET request body : None

r_post.json()['form']	# {'ID':'123', 'name':'Joseph'}

 

HTML for Webscraping

HTML Tags

<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b id='boldest'>Lebron James</b></h3>
<p> Salary : $ 92,000,000 </p>
<h3> Stephen Curry </h3>
<p> Salary : $ 85,000,000 </p>
<h3> Kevin Durant </h3>
<p> Salart : $73,200,000 </p>
</body>
</html>

 

Composition of an HTML Tag - HTML Anchor Tag

 

Composition of an HTML Tag - Hyperlink Tag

 

Composition of an HTML Tag - Opening and End Tags

 

Composition of an HTML Tag -  Hyperlink Content

 

Composition of an HTML Tag -  Attributes

 

HTML Trees - Document Tree

HTML Tables

<table>
	<tr>
    	<td>Pizza Place</td>
        <td>Orders</td>
        <td>Slices</td>
    </tr>
    <tr>
    	<td>Domino Pizza</td>
        <td>10</td>
        <td>100</td>
    </tr>
    <tr>
    	<td>Little Caesars</td>
        <td>12</td>
        <td>144</td>
    </tr>
</table>

 

Webscraping

What is Webscraping?

- process that can be used to automatically extract information from a website

- can easily be accomplished within a matter of minutes

 

Beautiful Soup

from bs4 import BeautifulSoup

html="<!DOCTYPE html><html><head><title>Page Title</title></head>
	  <body><h3><b id='boldest'>Lebron James</b></h3><p> Salary: $92,000,000
      </p><h3>Stephen Curry</h3><p> Salary: $85,000,000 </p><h3>Kevin Durant</h3>
      <p> Salary : $73,200,000 </p></body></html>"
      
soup = BeautifulSoup(html, 'html5lib')

 

Beautiful Soup Objects - Tag Object

 

Beautiful Soup Objects - HTML Tree

 

Beautiful Soup Objects - Parent attribute

 

Beautiful Soup Obejcts - Next-sibling attribute

 

Beautiful Soup Objects - Navigable string

find_all - Python iterable

find_all - Tag Object

first_row = table_row[0]
first_row : 

<tr><td>Pizza Place</td><td>Orders</td> <td>Slices </td></tr>

first_row.td:

<td>Pizza Place</td>

 

find_all - Variable Row

for i, row in enumerate(table_rows):
	print("row", i)
    cells = row.find_all("td")
    
    for j, cell in enumerate(cells):
    	print("column", j, "cell", cell)

 

find_all - Elements

for i, row in enumerate(table_rows):
	print("row", i)
    cells = row.find_all("td")
    
    for j, cell in enumerate(cells):
    	print("column", j, "cell", cell)

 

 

A Webpage Example - Overview Python Program

import requests
from bs4 import BeautifulSoup

page = requests.get("http://EnterWebsiteURL...").text

# Creates a BeautifulSoup Object
soup = BeautifulSoup(page, "html.parser")

# Pulls all instances of <a> tag
artists = soup.find_all('a')

# Clears data of all tags
for artist in artists:
	names = artist.contents[0]
    fullLink = artist.get('href')
    print(names)
    print(fullLink)

 

Reading : Webscraping

How web scraping works:

- HTTP Request : A web scraper sends an HTTP request to a specific URL, similar to how a web browser would when you visit a website. The request is usually an HTTP GET request, which retrieves the content of the web page.

- Web Page Retrieval : This content includes not only the visible text and media elements but also the underlying HTML structure that defines the page's layout

- HTML Parsing : Parsing involves breaking down the HTML structure into its individual components, such as tags, attributes, and text content. It creates a structured representation of the HTML content that can be easily navigated and manipulated

- Data Extraction : Scrapers locate the data by searching for relevant HTML tags, attributes, and patterns in the HTML structure

- Data Transformation : Extracted data may need further processing and transformation. This step ensures that the data is ready for analysis or other use cases.

- Storage : The choice of storage format depends on the specific project's requirements

- Automation : Automation tools allow for recurring data extraction from multiple web pages or websites. Automated scraping is especially useful for collecting data from dynamic websites that regularly update their content.

 

HTML Structure

- <html> is the root element of an HTML page

- <head> contains meta-information about the HTML page

- <body> displays the content on the web page, often the data of interest

- <h3> tags are type 3 headings, making text larger and bold, typically used for player names

- <p> tags represent paragraphs and contain player salary information

 

Composition of an HTML Tag

- An HTML tag consists of an opening (start) tag and a closing (end) tag

- Tags have names (e.g., <a> for an anchor tag)

- Tags may contain attributes with an attribute name and value, providing additional information to the tag

 

HTML Document Tree

- Tags can contain strings and other tags, making them the tag's children

- Tags within the same parent tag are considered siblings

- For example, the <html> tag contains both <head> and <bpdy> tags, making them descendants of <htm> but children of <html>. <head> and <body> are siblings

 

HTML Tables

- Define an HTML table using the <table> tag

- Each table row is defined with a <tr> tag

- The first row often uses the table header tag, typically <th>

- The table cell is represented by <td> tags, defining individual cells in a row

 

Web Scraping

- Required Tools : Requests and Beautiful Soup

# Import Beautiful Soup to parse web page content
from bs4 import BeautifulSoup

 

- Fetching and Parsing HTML

import requests
from bs4 import BeautifulSoup

# Specify the URL of the webpage you want to scrape
url = 'https://en.wiikipedia.org/wiki/IBM'

# Send an HTTP GET request to the webpage
response = requests.get(url)

# Store the HTML content in a variable
html_content = response.text

# Create a BeautifulSoup object to parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Display a snippet of the HTML content
print(html_content[:500])

 

- Naviagting the HTML Structure

# Find all <a> tags (anchor tags) in the HTML
links = soup.find_all('a')

# Iterate through the list of links and print their text
for link in links:
	print(link.text)

 

- Custom Data Extraction : Web scraping allows you to navigate the HTML structure and extract specific information based on your requirements. This may involve finding specific tags, attributes, or text content within the HTML document

- Using BeautifulSoup for HTML Parsing : BeautifulSoup allows you to find elements based on their tags, attributes, or text, making easier to extract the infromation you're interested in.

- Using pandas read_html for Table Extraction : pandas.read_html can automatically extract data from these tables and present it in a format suitable for analysis. It's similar to taking a table from a webpage and importing it into a spreadsheet for further analysis

 

 

 

 

 

 

Comments