Tools for Data Science

Notice

Recent Posts

Recent Comments

Link

« 2025/11 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Tags more

Archives

Today

Total

관리 메뉴

떼닝로그

Tools for Data Science - Data Science Tools 본문

Coursera/IBM Data Science

Tools for Data Science - Data Science Tools

떼닝 2023. 12. 12. 05:58

Data Science Tools

Categories of Data Science Tools

Data Science Categories

- raw data must pass through Data Science categories : such as Data Management, Data Integration and Transformation, Data Visualization, Model Building, Model Deployment, Model Monitoring and Assessment

- to do these tasks, data asset management and code asset management, execution environments, and development environments are needed

Data Management

- Collecting, persisting, and retrieving data securely, efficiently, and cost-effectively

- Data is collected from many sources

Data Integration and Transformation

- Extract, Transform, and Load (ETL)

- Extract the data and save it in a central repository

- Data Transformation is the process of transforming the values, structure and format of data

- Transformed data is loaded back to the data warehouse

Data Visualization

- Graphical representation of data and information

- In the form of charts, plots, maps, and animations

- Data visualization conveys data more effectively

Data Visualization Examples

- Bar charts : which compares the size of each component

- Treemap : which displays hierarchy data

- Line chart : which plots a series of data points over time

- Map chart : which displays data by location. can also be applied to other locations like websites

Model Building

- step where you train the data and analyze patterns using suitable machine learning algorithms

- create machine learning models using IBM Watson machine learning

Model Deployment

- process of integrating a model into a production environment

- uses APIs to enable data-based decisions

- eg. SPSS Collaboration and Deployment Services

Model Monitoring and Assessment

- Model Monitoring tracks deployed models

- Model Assesment checks for accuracy, fairness, and robustness monitoring

- IBM Watson Open Scale is a popular Model Monitoring and Assessment tool

Code Asset Management

- is a unified view where you manage an inventory of assets

- developers use versioning to track and manage changes to a software project's code

- collaboration allows diverse people to share and update the same project together

- eg. Github

Data Asset Management

- Platform for organizing and managing the data

- supports replication, backup, and access right management

Development Environments

- Integrated development environments(IDEs) provide a workspace and tools to work on source code

- IDEs like IBM Watson Studio provide testing and simulation tools to emulate the real world so you can see how your code will behave after it is deployed

Exectuion Environment

- has libraries for code compiling and system resources to execute and verify code

- cloud-based exectuion environments aren't tied to specific hardware or software

- IBM Watson Studio has tools for data preprocessing model training and deployment

Fully-Integrated visual tools

- cover all tooling components, and can be used to develop deep learning and machine learning models

- eg. Watson Studio by IBM, IBM Cognos Dashboard Embedded by IBM

Recap

- Data science Task categories : data mangement, data integration and transformation, data visualization, model building, model deployement, model monitoring and assessment

- data science tasks are supported by : data asset management, code asset management, execution environments, devlopment environments

볼드표시한 것들은 여기가 중요하다!!도 어느정도 있지만

그냥 내가 새로 알게 된...? 그런 부분들에도 볼드를 치는 편...

Open Source Tools for Data Science - Part 1

Data Management

- Open-source data management tools are relational databases like : PostgreSQL, MongoDB, Elasticsearch, Hadoop HDFS, ...

Data integration and transformation

- data integration and transformation in the classic data warehousing world is for ETL or ELT

- also termed Data Refinery and Cleansing

- eg. Apache Airflow, Kubeflow, Kafka, NodeRED

Data Visualization

open source tools :

- supported by programming libraries where you need to use code

- containing a user interface

Model monitoring and assessment

- tools to keep track of machine learning model's prediction performance to maintain outdated models

- eg. ModelDB, Prometheus, IBM AI Fairness 360 Open Source Toolkit

Code Asset Management

- tools for code asset management, also known as version management or version control

- eg. git, Github, GitLab, BitBucket

Data Asset Management

- tools for data asset management, also known as data governance or data lineage

- eg. Apache Atlas, ODPi EGERIA

그냥 이런이런 게 있다 나열형...

안 듣고 넘어갔어도 됐을듯

다른 설명보다 IBM tools에 대해 굉장히 설명이 길어지는 게 좀 웃기다ㅋ

Open Source Tools for Data Science - Part 2

Jupyter

- supports more than a hundred different programming languages through "kernels"

- encapsulates the exectuion environment for the different programming languages

- key property of Jupyter Notebook is to unify documentation, code, output from code, shell commands, and visualizations in a single document

RStudio

- oldest development environments for statistics and data science

- exclusively runs R and associated libraries

- Enables Python development

- provides optimal user experience when tightly integrated in the tool

- unifies : programming, execution, debugging, remote data access, data exploration, visualization

Apache Spark

- provides cluster execution environment

- provides linear scalability - more the number of servers in a cluster more the performance

이것도 이것저것 나열...

일단 해보기 전까지는 모르지 않을까요 ㅠ

Commercial Tools for Data Science

Data Management

- eg. Oracle Database, MSSQL, IBM DB2

- commercial supports delivered by : software vendors, influential partners, support networks

Data Integration and Transformation

- Extract, Transform, and Load (ETL) tools : Informatica, IBM InfoSphere DataStage

- support the design and deployment of ETL data processing pipelines through a graphical interface

- bring along connectors to most of the commercial and open-source target information systems

Data Visualization

- focus of these tools is to create visual reports and live dashboards

- eg. Tableau, MS PowerBI, IBM Cognos Analytics

- visualization can show relationships between different columns in a table

Model Building

- should use data mining product

- eg. SPSS Modeler, SAS enterprise modeler

Model Deployement

- tightly integrated into the model-building process

- commercial software can export models in an open format

Model monitoring and code asset management

- open source : model monitoring, code asset management

Data asset management

- functions include : data governance, data versioned and annotated, data dictionary, data lineage, data privacy and retention

Development environments (Watson Studio Desktop)

- fully integrated devleopment environment for data scientists

- combines Jupyter Notebooks with graphical tools

- fully integrated visual tools

Cloud Based Tools for Data Science

Fully integrated visual tools and platforms

Large-scale execution of data science workflows happens in compute clusters :

- composed of multiple server machines

- Watson Studio and Watson OpenScale, cover the complete development life cycle for all data science, machine learning, and AI tasks

- fully cloud-hosted offering supports the complete development life cycle of all data science, machine learning, and AI tasks

H2O Driverless AI:

- a product that you download and install

- one-click deployment for the common cloud service providers

- cloud provider does not do operations and maintenance

Data Management

- comprises software as a service (SaaS) versions of existing open source and commercial tools

- eg. Amazon DynamoDB : allows storage and retireving data in a key-value or a document store format

Data Integration and Transformation

- Comprises ETL and ELT tools

- transformation steps are pushed toward the domain of the data scientist or data engineer

Data Refinery:

- is part of IBM Watson Studio

- allows transforming large amounts of raw data into consumable, quality information in a spreadsheet

Data Visualization

- 3D bar chart : to visualize a target alue on the vertical dimension, whcih is dependent on two other values in the horizontal dimensions

- Hierarchical Edge bundling depicts correlations and affiliations between entities

- 2D scatter plot with a heat map : shows two dependent data field on the y-axis with different color intensities

- Tree map : shows the distribution of subsets within a set

- Word cloud : pops out significant terms in a document corpus

Model Deployment

- tightly integrated into the model-building process

- commercial software can export models in an open format, such as Predictive Markup Model Language(PMML)

Model monitoring and assessment

- a cloud tool to monitor deployed machine learning and deep learning models continuously

- eg. Amazon SageMaker Model Monitor, IBM Watson OpenScale

위에랑 겹치는 내용이 생각보다 많았음...

그리고 약간 결론은 IBM 자체 SW가 짱이다!!! 으로 몰아가는듯

역시 자기들이 만든 강의다 이거지...

Practice Quiz - Data Science Tools

Q. Which Data Science category do you extract, transform, and load data?

A. Data Integration and Transformation

Q. Which open-source tool is the standard used for code asset management?

A. Git

Q. Which open source tool supports more than a hundred different programming languages through "kernels"

A. Jupyter

Q. Which commercial tool can be used to define and execute data integration processes in a spreadsheet-style?

A. Watson Studio Desktop

Q. Which service offers a document data structure format?

A. JSON

'Coursera > IBM Data Science' 카테고리의 다른 글

Tools for Data Science - Libraries, APIs, Datasets and Models (2)	2023.12.14
Tools for Data Science - Languages of Data Science (0)	2023.12.14
What is Data Science - Data Literacy for Data Science (2) (2)	2023.12.05
What is Data Science - Data Literacy for Data Science (1) (1)	2023.12.04
What is Data Science - Applications and Careers in Data Science (2) (1)	2023.12.04

'Coursera/IBM Data Science' Related Articles

Comments

떼닝로그

Tools for Data Science - Data Science Tools 본문

Tools for Data Science - Data Science Tools

Tools for Data Science

Data Science Tools

Categories of Data Science Tools

Open Source Tools for Data Science - Part 1

Open Source Tools for Data Science - Part 2

Commercial Tools for Data Science

Cloud Based Tools for Data Science

Practice Quiz - Data Science Tools

'Coursera > IBM Data Science' 카테고리의 다른 글

티스토리툴바