Python for Data Science, AI & Development - Working with Data in Python (2)
Python for Data Science
Pandas
Loading Data with Pandas
Importing Pandas
import pandas
csv_path = 'file1.csv'
df = pandas.read_csv(csv_path)
Importing
import pandas as pd
csv_path = 'file1.csv'
df = pd.read_csv(csv_path)
Dataframes (CSV)
csv_path = 'file.csv'
df = pd.read_csv(csv_path)
df.head()
- df = dataframe의 준말
- head() 사용하면 위에서부터 다섯 행 가져오기 가능
Dataframes (XLSX)
xlsx_path = 'file1.xlsx'
df = pd.read_excel(xlsx_path)
df.head()
Dataframes
- dictionary type 이용해서 dataframe 만들 수 있음
songs = {'Album' : ['Thriller', 'Back in Black', 'The Dark Side of the Moon',\
'The Bodyguard', 'Bat Out of Hell'],
'Released' : [1982, 1980, 1973, 1992, 1977],
'Length' : ['00:42:19', '00:42:11', '00:42:49', '00:57:44', '00:46:33']}
songs_frame = pd.DataFrame(songs)
- 아래 이용해서 특정 속성만 가져오기도 가능!
x=df[['Length']]
- 속성 여러 개 가져오기도 가능!
y=df[['Artist', 'Length', 'Genre']]
Pandas : Working with and Saving Data
List Unique Values
- pandas 내장함수 unique를 사용해서 고유한(?) 값들만 중복 없이 가져올 수 있음
- 어떤 수식(?)이 주어졌을 때 그에 대한 참/거짓을 결과로 내보낼 수도 있음
- 어떤 수식에 대한 결과값들만 모아서 새로운 dataframe을 만들 수도 있음
Save as CSV
df1.to_csv('new_songs.csv')
Reading: Pandas
What is Pandas?
- Data Structures : Pandas offers two primary data structures - DataFrame and Series
1. A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
2. A Series is a one-dimensional labeled array, essentially a single column or row of data
- Data Import and Export : Pandas makes it easy to read from various sources, including CSV files, Excel spreadsheets, SQL databases, and more. It can also export data to these formats, enabling seamless data exchange.
- Data Merging and Joining : You can combine multiple DataFrames using methods like merge and join, similar to SQL operations, to create more complex datasets from different sources.
- Efficient Indexing : Pandas provides efficient indexing and selection methods, allowing you to access specific rows and columns of data quickly
- Custom Data Structures : You can create custom data structures and manipulate data in ways that suit your specific needs, extending Pandas' capabilities
What is a Series?
- Series is a one-dimensional labeled array in Pandas.
- can be thought of as a single column of data with labels or indices for each element.
- can create a Series from various data sources, such as lists, NumPy arrays, or dicitonaries.
import pandas as pd
# Create a Series from a list
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)
- pandas automatically assigned numerical indices (0, 1, 2, 3, 4) to each element, can also specify custom labels if needed
Accessing Elements in a Series
- You can access elements in a Series using the index lables or integer positions.
Accessing by label
print(s[2]) # access the element with label 2 (value 30)
Accessing by position
print(s.iloc[3]) # access the element at position 3 (value 40)
Accessing multiple elements
print(s[1:4]) # access a range of elements by label
Series Attributes and Methods
- values : returns the Series data as a NumPy array
- index : returns the index (labels) of the Series
- shape : returns a tuple representing the dimensions of the Series
- size : returns the number of elements in the Series
- mean(), sum(), min(), max() : calculate summary statistics of the data
- unique(), nunique() : get unique values or the number of unique values
- sort_values(), sort_index() : sort the Series by values or index labels
- isnull(), notnull() : check for missing (NaN) or non-missing values
- apply() : apply a custom function to each element of the Series
What is a DataFrames?
- a DataFrame is a two-dimensioinal labeled data structure with columns of potentially different data types
- think of it as a table where each column represents a variable, and each row represents an observation or data point
- DataFrames are suitable for a wide range of data, including structured data from CSV files, Excel spreadsheets, SQL databases, and more
Column Section :
- you can select a single column from a DataFrame by specifying the column name within double brackets
- multiple columns can be selected in a similar manner, creating a new DataFrame
print(df['Name']) # access the 'name' column
Accesing rows:
You can access rows by their index using .iloc[] or by label using .loc[]
print(df.iloc[2]) # access the third row by position
print(df.loc[1]) # access the second row by label
Slicing:
You can slice DataFrames to select specific rows and columns
print(df[['Name', 'Age']]) # Select specific columns
print(df[1:3]) # Select specific rows
DataFrame Attributes and Methods
- shape : returns the dimensions (number of rows and columns) of the DataFrame
- info() : provides a summary of the DataFrame, including data types and non-null counts
- describe() : generates summary statistics for numerical columns
- head(), tail() : displays the first or last n rows of the DataFrame
- mean(), sum(), min(), max() : calculate summary statistics for columns
- sort_values() : sort the DataFrame by one or more columns
- groupby() : group data based on specific columns for aggregation
- fillna(), drop(), rename() : handle missing values, drop columns, or rename columns
- apply() : apply a function to each element, row, or column of the DataFrame
Practice Quiz
Q. What python object do you cast to a dataframe?
A. dictionary
Q. How would you access the first-row and first column in the dataframe df?
A. df.ix[0, 0]
Q. What is the proper way to load a CSV file using pandas?
A. pandas.read_csv('data.csv')
Q. How would you select the Genre disco? Select all that apply
A. df.iloc[6, 4]
Q. Which will NOT evaluate to 20.6? Select all that apply.
A. df.loc[4, 'Music Recording Sales']
df.iloc[6, 'Music Recording Sales (millions)']
Q. How do we select Albums The Dark Side of the Moon to Their Greatest Hits (1971-1975)? Select all that apply
A. df.loc[2:5, 'Album']
df.iloc[2:6, 1]