What is Data Science - Data Literacy for Data Science (1)
Data Literacy for Data Science
Understanding Data
Understanding Data
What is Data?
- Data is unorganized information that is processed to make it meaningful
- data comprises of facts, observations, perceptions, numbers, characters, symbols, and images that can be interpreted to derive meaning
Types of Data
- one of the ways in which data can be categorized is by its structure
- data can be structured, semi-structured, or unstructured
Structured Data
- has a well-defined structure or adheres to a specified data model (adhere : 부착되다, 들러붙다)
- can be stored in well-defined schemas such as databases
- can be represented in tabular manner with rows and columns (tabular : 표로 나타낸)
- objective facts and numbers that can be collected, exported, stored, and organized
- sources : SQL databases, online transaction processing(OLTP) systems, Spreadsheets, online forms, sensors, GPS and RFID, network and web server logs
- can easily examine structured data with standard data analysis methods and tools
Semi-Structured Data
- data that has some organizational properties but lacks a fixed or rigid schema
- cannot be stored in the form of rows and columns as in databases
- contains tags and elemtents, or metadata, which is used to group data and organize it in a hierarchy
- sources : e-mails, xml and other markup languages, binary executables, TCP/IP packeets, zipped files, integration of data from different sources
- XML and JSON allow users to define tags and attributes to store data in a hierarchical form and are used widely to store and exchnge semi-structured data
Unstructured Data
- does not have an easily indentifiable structure
- cannot be organized in a mainstream relational database in the form of rows and columns
- does not follow any particular format, sequence, semantics, or rules
- sources : web pages, social media feeds, images in varied file formats(such as JPEG, GIF, PNG), video and audio files, document and PDF files, powerpoint presentations, media logs, surveys
- can be stored in files and documents for manual analysis or in NoSQL databases that have their own analysis tools for examining this type of data
To Summarize
- Structured data is data that is well organized in formats that can be stored in databases and lends itself to standard data analysis methods and tools
- Semi-structured data is that is somewhat organized and relies on meta tags for grouping and hierarchy
- Unstructured data is data that is not conventionally organized in the form of rows and columns in a particular format (convntionally : 진부하게, 관례적으로, 평범하게)
Data Sources
Relational Databases
- MSSQL, Oracle, MySQL, IBM DB2
- Stored structured data that can be leveraged for analysis
Flat File and XML Datasets
- eg. government organizations releasing demographic and economic datasets on an ongoing basis
- companies that sell specific data (eg. point-of-sale data or financial data, weather data)
- which businesses can use to define strategy, predict demand, and make distribution decisions
- typically made available as flat files, spreadsheet files, or XML documents
Flat Files
- store data in plain text format
- each line, or row, is one record
- each value is separated by a delimiter
- all of the data in a flat file maps to a single table
- most common flat file format is .CSV
Spreadsheet files
- special type of flat files
- organize data in a tabular format
- can contain multiple worksheets
- .XLS or .XLSX are common spreadsheet formats
- other format includes google sheets, apple numbers, and LibreOffice Calc
XML files
- contain data values that are identified or marked up using tags
- can support complex data structures (such as hierarchable)
- common uses include online surveys, bank statements, and other unstructured data sets
APIs and Web Services
- API(Application Program Interfaces) and Web services which multiple users or applications can interact with and obtain data for processing or analysis
- typically listen for incoming requests, and return data in plain text
Popular examples of APIs
- Twitter and Facebook APIs for customer sentiment analysis
- Stock Market APIs for trading and analysi
- Data Lookup and Validation APIs for cleaning and co-relating data
Web scraping
- extract relevant data from unstructured sources
- also known as screen scraping, web harvesting, and web data extension
- downloads specific data based on defined parameters
- can extract text, contact information, images, videos, product items, and more...
- popular uses : collecting product details from retailers, generating sales leads through public data sources, extracting data from posts and authors, ...
- popular web scraping tools : beautifulsoup, pandas, selenium, scrapy
Data Streams and feeds
- aggregating streams of data flowing from instruments, IoT devices and applications, GPS data from cars, computer programs, websites, and social media posts (aggregate : 합계, 총액)
- can be leveraged include : stock and market tickers for financial trading, retail transaction streams for predicting demand and supply chain management, surveillance and video feeds for threat detection
- social media feeds for sentiment analysis, sensor data feeds for monitoring industrial or farming machinery, ...
- popular technologies used to process data streams include : kafka, apache spark, apache storm
- RSS(or really simple syndication) feeds : capturing updated data from online forums and news sites where data is refreshed on an ongoing basis
Viewpoints : Working with Varied Data Sources and Types
쉬어가겠읍니다...
직접 해보면서 깨달아보겠읍니다...
일단 저를 채용해주시죠...
Reading : Metadata
What is Metadata?
- metadata is data that provides information about other data
Technical Metadata
- technical metadata is metadata which defines the data structures in data repositories or platforms, primarily from a technical perspective
- eg. tables that record information about the tables stored in a database like : each table's name, the number of columns and rows each table has
- is typically stored in specialized tables in the database called the System Catalog
Process Metadata
- describes the processes that operate behind business systems such as data warehouse, accounting systems, or customer relationship management tools
- process metadata for such systems includes tracking things like : process start and end times, disk usage, where data was moved from and to, how many users access the system at any given time
- invaluable for troubleshooting and optimizing workflows and ad hoc queries (ad hoc : 즉석)
Business metadata
- information about the data described in readily interpretable ways (interpretable : 이해 가능한 자료)
- eg. how the data is acquired, what the data is measuring or describing, the connection between the data and other data sources
- serves as documentation for the entire data warehouse system
Managing Metadata
- includes developing and administering policies and processes to ensure information which can be accessed and integrated from various sources and appropriately shared across the entire enterprise
- creation of a reliable, user-friendly data catalog is a primary objective of a metadata management model
- data catoalog is a core component of a modern metadata management system, serving as the main asset around which metadata management is administered
Why is Metadata Management Important?
- having access to a well implemented data catalog greatly enhances data discovery, repeatability, governance, and can also facilitate access to data
- well managed metadata helps you to understand both the business context associated with the enterprise data and the data lineage, which helps to improve data governance (governance : 관리, 관리 방식)
- data governance is a data management concept concerning the capability that enables an organization to ensure that high data quality exists throughout the complete lifecycle of the data, and data controls are implemented that support business objectives
- key focus areas of data governance include availability, usability, consistency, data integrity and data security and includes establishing processes to ensure effective data management throughout the enterprise
Popular tools for metadata management
- IBM InfoSphere Information Server
- CA Erwin Data Modeler
- Oracle Warehouse Builder
Summary
- metadata is data that provides information about other data, and includes three main types : technical, process, and business metadata
- the technical metadata for relational databases is typically stored in specialized tables in the database called the system catalog
- a primary objective of business metadata management modelling is the creation and maintenance of a reliable, user-friendly data catalog
- having access to a well-implemented data catalog greatly enhances data discovery, repeatability, governance, and can also facilitate access to data
Practice Quiz : Metadata
Q. What is the main purpose of business metadata?
A. To facilitate data discovery for business-minded users
Q. Where is the technical metadata for relational databases typically stored?
A. In specialized tables in the database called the System Catalog
Q. What is one of the key focus areas of data governance?
A. Availability of data
Lesson Summary : Understanding Data
Structured data
- adheres to a model
- stored in well-defined schemas
- represented in tables
- schemas define relationships between tables
Semi-structured data
- lacks fixed structure
- uses metadata to organize it
- metadata provides information about the data
Metadata
- also needs managing
- usually stored in a data catalog
- a data catalog enhances data discovery, repeatability, governance, and access
Unstructured data
- heterogeneous (heterogeneous : 여러 다른 종류들로 이뤄진)
- comes from broad range of sources
- variety of business intelligence and analytics applications
- analyzing unstructured data often requires artificial intelligence to gain insights
Data sources
- anything with an electronic footprint
- automatic or manual storage
- older analog records need converting to electronic formats for effective mining and processing
Finding useful data for analysis
- organization's internal applications
- databases
- public and private data sets
- proprietary data sets
- usually provided as a flat file such as CSV
Flat file Formats
- XML, JSON
- readable by humans and machines
- does not have predefined schema
- easily transferred between evolving data structures
Application Programming interface (API)
- how you access data in cloud-based software
- JSON data often requires RESTful APIs
- data providers provide API to access their data
- Twitter and Facebook both provide APIs to source data from posts for performing tasks such as customer satisfcation
Data transfer
- gathering and managing data often done by data engineers
- data sets data scientists use can be massive
- large data sets stem from IOT applications, sensor data, and social media
Data Specialists
- data scientists need to know the modern-day data ecosystem for : oragnizing, storage, manipulation, retrieval
Practice Quiz - Understanding Data
Q. Which of the following is an example of structured data?
A. Spreadsheets such as Excel
Q. What is one common characteristic of flat files and spreadsheet files?
A. They store data in a structured way using tables.
Q. What is one common use of web scraping?
A. Collecting training and testing datasets for machine learning models