Coursera/IBM Data Science

Python for Data Science, AI & Development - Working with Data in Python (2)

떼닝 2024. 1. 22. 22:35

Python for Data Science

Pandas

Loading Data with Pandas

Importing Pandas

import pandas
csv_path = 'file1.csv'
df = pandas.read_csv(csv_path)

 

Importing

import pandas as pd
csv_path = 'file1.csv'
df = pd.read_csv(csv_path)

 

Dataframes (CSV)

csv_path = 'file.csv'
df = pd.read_csv(csv_path)
df.head()

- df = dataframe의 준말

- head() 사용하면 위에서부터 다섯 행 가져오기 가능

 

Dataframes (XLSX)

xlsx_path = 'file1.xlsx'
df = pd.read_excel(xlsx_path)
df.head()

 

Dataframes

 

- dictionary type 이용해서 dataframe 만들 수 있음

songs = {'Album' : ['Thriller', 'Back in Black', 'The Dark Side of the Moon',\
'The Bodyguard', 'Bat Out of Hell'],
'Released' : [1982, 1980, 1973, 1992, 1977],
'Length' : ['00:42:19', '00:42:11', '00:42:49', '00:57:44', '00:46:33']}

songs_frame = pd.DataFrame(songs)

 

- 아래 이용해서 특정 속성만 가져오기도 가능!

x=df[['Length']]

 

- 속성 여러 개 가져오기도 가능!

y=df[['Artist', 'Length', 'Genre']]

 

 

Pandas : Working with and Saving Data

List Unique Values

- pandas 내장함수 unique를 사용해서 고유한(?) 값들만 중복 없이 가져올 수 있음

 

- 어떤 수식(?)이 주어졌을 때 그에 대한 참/거짓을 결과로 내보낼 수도 있음

 

- 어떤 수식에 대한 결과값들만 모아서 새로운 dataframe을 만들 수도 있음

 

Save as CSV

df1.to_csv('new_songs.csv')

 

Reading: Pandas

What is Pandas?

- Data Structures : Pandas offers two primary data structures - DataFrame and Series

    1. A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

    2. A Series is a one-dimensional labeled array, essentially a single column or row of data

- Data Import and Export : Pandas makes it easy to read from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.

- Data Merging and Joining : You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.

- Efficient Indexing : Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly

- Custom Data Structures : You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas' capabilities

 

What is a Series?

- Series is a one-dimensional labeled array in Pandas.

- can be thought of as a single column of data with labels or indices for each element.

- can create a Series from various data sources, such as lists, NumPy arrays, or dicitonaries.

import pandas as pd

# Create a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)

print(s)

- pandas automatically assigned numerical indices (0, 1, 2, 3, 4) to each element, can also specify custom labels if needed

 

Accessing Elements in a Series

- You can access elements in a Series using the index lables or integer positions.

Accessing by label

print(s[2])	# access the element with label 2 (value 30)

 

Accessing by position

print(s.iloc[3])	# access the element at position 3 (value 40)

 

Accessing multiple elements

print(s[1:4])	# access a range of elements by label

 

Series Attributes and Methods

- values : returns the Series data as a NumPy array

- index : returns the index (labels) of the Series

- shape : returns a tuple representing the dimensions of the Series

- size : returns the number of elements in the Series

- mean(), sum(), min(), max() : calculate summary statistics of the data

- unique(), nunique() : get unique values or the number of unique values

- sort_values(), sort_index() : sort the Series by values or index labels

- isnull(), notnull() : check for missing (NaN) or non-missing values

- apply() : apply a custom function to each element of the Series

 

What is a DataFrames?

- a DataFrame is a two-dimensioinal labeled data structure with columns of potentially different data types

- think of it as a table where each column represents a variable, and each row represents an observation or data point

- DataFrames are suitable for a wide range of data, including structured data from CSV files, Excel spreadsheets, SQL databases, and more

 

Column Section : 

- you can select a single column from a DataFrame by specifying the column name within double brackets

- multiple columns can be selected in a similar manner, creating a new DataFrame

print(df['Name'])	# access the 'name' column

 

Accesing rows:

You can access rows by their index using .iloc[] or by label using .loc[]

print(df.iloc[2])	# access the third row by position
print(df.loc[1])	# access the second row by label

 

Slicing:

You can slice DataFrames to select specific rows and columns

print(df[['Name', 'Age']])	# Select specific columns
print(df[1:3])	# Select specific rows

 

DataFrame Attributes and Methods

- shape : returns the dimensions (number of rows and columns) of the DataFrame

- info() : provides a summary of the DataFrame, including data types and non-null counts

- describe() : generates summary statistics for numerical columns

- head(), tail() : displays the first or last n rows of the DataFrame

- mean(), sum(), min(), max() : calculate summary statistics for columns

- sort_values() : sort the DataFrame by one or more columns

- groupby() : group data based on specific columns for aggregation

- fillna(), drop(), rename() : handle missing values, drop columns, or rename columns

- apply() : apply a function to each element, row, or column of the DataFrame

 

Practice Quiz

Q. What python object do you cast to a dataframe?

A. dictionary

 

Q. How would you access the first-row and first column in the dataframe df?

A. df.ix[0, 0]

 

Q. What is the proper way to load a CSV file using pandas?

A. pandas.read_csv('data.csv')

 

 

Q. How would you select the Genre disco? Select all that apply

A. df.iloc[6, 4]

 

Q. Which will NOT evaluate to 20.6? Select all that apply.

A. df.loc[4, 'Music Recording Sales']

    df.iloc[6, 'Music Recording Sales (millions)']

 

Q. How do we select Albums The Dark Side of the Moon to Their Greatest Hits (1971-1975)? Select all that apply

A. df.loc[2:5, 'Album']

     df.iloc[2:6, 1]