Protopia AI Glossary
A method used to efficiently and accurately compute the gradient of a function with respect to its inputs. The gradient provides information about the rate of change of the function in relation to its inputs and is used in optimization algorithms to update the parameters of a model.
Automatic differentiation is used in machine learning and artificial intelligence to compute gradients for training models. It is beneficial for training deep neural networks, where the computation graph can be complex, and the gradients are required for many optimization algorithms.
Computational cost refers to the computing resources required to complete a specific task. These resources can be memory, computation time, bandwidth, or any other resource required to carry out any computation.
Solutions can be evaluated using computational costs to determine their efficiency and effectiveness. Computational cost is often combined with the run time complexity of software to determine the overall efficiency of the software design and implementation.
Confidential Computing offers a hardware-based security solution designed to protect data during use with application-isolation technology. The Confidential Computing Consortium was formed in 2019, with participation from major companies like Google, Intel, Meta, and Microsoft, to advocate for hardware-based
Trusted Execution Environments (TEEs). These TEEs isolate computations, providing security for the data being used. However, the main drawback of this approach is the increased cost for companies to migrate their machine learning services onto platforms that offer these specialized hardware infrastructures. It’s also worth noting that while it provides security benefits, confidential computing is not entirely without risks.
Data access is the ability for users to access their data given physical, software, or legal and policy-driven constraints. Data access can be fine-grained and allows users to carry out various operations such as creating, replacing, updating, deleting, or moving data for ML and AI operations.
The level of access granted to a user can vary depending on the data’s sensitivity and the user’s role in the organization. Data access can be managed by using authentication and authorization mechanisms, such as user accounts, passwords, and access control lists, to ensure that only authorized users can access sensitive data. Effective data access management helps to ensure the confidentiality, integrity, and availability of data and is critical for protecting sensitive information and maintaining the security of an organization’s information systems.
Data anonymization protects private or sensitive information by erasing or encrypting identifiers that connect an individual to the data.
For example, you can run Personally Identifiable Information (PII) such as names and addresses through a data anonymization process that retains the data but keeps the source anonymous. Data masking and data perturbation are some techniques to anonymize data.
Data governance is the overall management of the availability, usability, integrity, and security of the data used in an organization. It involves the development of policies, procedures, and standards for collecting, storing, processing, and sharing data, in addition to ensuring that the data is of high quality and protected from unauthorized access.
The goal of data governance is to ensure that data is effectively used to support decision-making, meet regulatory obligations, and achieve the organization’s goals. Data governance involves many stakeholders, including IT, business leaders, and data stewards, who collaborate to manage and govern data as a critical organizational asset.
Data labeling identifies objects on raw data such as images, text, videos, and audio. The goal is to provide one or more informative labels to provide context so that a machine learning model can learn from it.
For example, labels might indicate whether a photo contains a train or a horse, which words were said in a voice recording, or if a lab image contains a tumor. Data labeling is required for various use cases, including computer vision, natural language processing, and speech recognition. Supervised machine learning models are built on large volumes of high-quality labeled data.
Data masking obscures or replaces sensitive information in a dataset to minimize exposure while maintaining the data’s functional value. The purpose is to prevent AI models from learning and incorporating sensitive information, such as personal identification numbers, financial data, or medical records, which could be used for malicious purposes or violate privacy laws.
Masked data can be used for training and analytics purposes while maintaining the confidentiality and security of the original data. Different data masking techniques include substitution and encryption.
Data ownership refers to the rights and responsibilities of individuals or organizations in the collection, storage, use, and distribution of data. It defines who has the legal right to control the access to and use of specific data and who is responsible for ensuring its security, accuracy, and compliance with relevant laws and regulations.
In organizations, data ownership is typically assigned to specific individuals or departments, such as the chief information security officer (CISO) or the data custodian, who is responsible for managing and safeguarding the data. Data ownership can also be shared between multiple individuals or organizations, such as in the case of partnerships.
Clearly defining data ownership is important for ensuring the security, accuracy, and privacy of data, as well as for managing the risk associated with the collection, storage, and use of data. It also helps to ensure that data is used ethically and transparently and that any conflicts over data access or use are resolved fairly and efficiently.
Data perturbation changes an original dataset by applying techniques that round numbers and add random noise. The range of values needs to be in proportion to the perturbation. A small base may lead to weak anonymization, while a large base can reduce the dataset’s utility.
Data control refers to the measures and processes put in place to manage the access, use, and dissemination of data within an organization. It involves defining who can access and use specific data, what actions they can perform, and under what circumstances. Data control is important for maintaining the confidentiality, integrity, and security of data, as well as for ensuring that data is used responsibly and lawfully.
Data control can be implemented through a combination of technical, administrative, and physical controls, such as access control lists, firewalls, encryption, and data backup and recovery processes.
Data redaction refers to removing certain pieces of information from data, designed to keep that data from being linked to individuals or used for wrongdoing. With data redaction, sensitive, confidential, or personally identifiable information is removed, or in some instances, blacked out.
Data sharing is making data available to other individuals or organizations. It involves data exchange between individuals, groups, or organizations, either within the same organization or across multiple organizations. Data sharing can occur in many different forms, such as file sharing, database sharing, cloud sharing, and API sharing.
Data Tokenization is a process by which sensitive data is replaced by non-sensitive characters known as a token. The token corresponds to the sensitive data through an external data tokenization system. Data can be tokenized and de-tokenized as often as needed with approved access to the system.
Data-type agnostic is the property of a system, process, or algorithm that can handle and process different types of data without any bias towards any particular type of data. This means that the system, method, or algorithm can handle both structured and unstructured data, as well as data in various formats, such as text, images, audio, and video, without requiring any special treatment or changes.
Data-type agnostic systems are becoming increasingly crucial when organizations deal with large and diverse data sets containing a wide range of data types. Data-type agnostic systems are designed to be flexible and adaptable and to handle large and complex data sets with ease, making them well-suited for modern data-driven organizations.
In data engineering and machine learning, a data pipeline refers to the steps involved in extracting, transforming, loading, and processing data. The goal of a data pipeline is to automate the flow of data from raw data sources to the final storage location, enabling organizations to quickly and easily access and analyze their data.
Data pipelines can be built using various tools and technologies, including batch processing frameworks, data processing engines, and cloud-based data platforms. Having an effective data pipeline can help organizations improve the quality and accuracy of their data, gain insights and make better decisions, and increase the efficiency and speed of their data processing.
Deployment data refers to the data used to deploy and run a machine-learning model in a production environment. This data includes information about the hardware and software resources the model will be deployed on, the input and output data formats, and any pre-processing or post-processing steps that must be performed.
Deployment data for machine learning models and aI also includes information about how the model will be integrated into a larger system or application. This may include information about the APIs or services that will be used to access the model, as well as any security or access controls that need to be put in place.
Having accurate and complete deployment data is important for ensuring machine learning models are deployed correctly and perform well in production. Effective management of deployment data helps to ensure that machine learning models and AI can be deployed and used effectively in real-world applications.
Differential privacy is a system for sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals. Differential Privacy provides margins on how much a single data record from a training dataset contributes to a machine-learning model. There is a membership test on the training data records, and it ensures if a single data record is removed from the dataset, the output should not change beyond a certain threshold.
A technique in machine learning and artificial intelligence that involves training models using gradient-based optimization. Differentiable programs are programs that rewrite themselves at least one component by optimizing along a gradient like neural networks do use optimization algorithms such as gradient descent.
Differentiable programming makes it possible to train models with large amounts of data and complex architectures, such as deep neural networks.
An edge system is a computing system that is located at the “edge” of a network, close to the source of the data. Edge systems are built to process, analyze, and act on data in real-time, without transmitting all of the data to a centralized processing system. Edge systems are used in various applications, including the Internet of Things (IoT), where devices at the edge of the network generate and collect large amounts of data. By processing this data at the edge, edge systems can reduce the amount of data that needs to be transmitted to centralized systems, reducing latency, improving response times, and reducing the demands on centralized computing resources.
Edge systems may include a range of technologies and devices, including sensors, gateways, embedded computers, and smart devices. They may be powered by various processors and platforms, ranging from microcontrollers and single-board computers to powerful servers and cloud computing platforms. By processing data at the edge, edge systems can help organizations to make decisions in real-time, to respond quickly to changing conditions, and to provide a more responsive and effective service to their customers.
Encryption is the process of converting plaintext data into a coded form of data that can only be unlocked by someone who has the appropriate key. The purpose of encryption is to protect sensitive information from unauthorized access and to ensure that data is transmitted securely across networks or stored safely on devices. Encryption algorithms use mathematical formulas to scramble the data, and the key is used to unscramble the data when needed.
Encryption is used in various applications, including digital communications, file storage, and secure online transactions. It is also used to protect sensitive data in transit, such as financial information and data stored on smartphones. Encryption is an essential tool for protecting privacy and data security and is a critical component of many security and privacy protocols. However, encryption can also be a source of complexity, and it is vital to understand the strengths and weaknesses of different encryption algorithms and to use encryption appropriately to protect sensitive information.
Federated Learning is a machine learning technique that enables training on a decentralized dataset distributed across multiple devices. Instead of sending data to a central server for processing, the training occurs locally on each device, and only model updates are transmitted to a central server.
Gradient-based optimization is a method used in machine learning and artificial intelligence to update the parameters of a model to minimize a loss function. The loss function measures the error between the model’s predictions and the actual outputs and is used in the optimization process.
In gradient-based optimization, the gradient of the loss function with respect to the model parameters is computed using automatic differentiation. The gradient provides information about how the model parameters should be updated to reduce the loss.
Generative AI is a type of artificial intelligence that creates new data or content based on the patterns and structures it has learned from existing data. Generative AI produces new instances that share similarities with the input data and can include text, art music, and design. By understanding the underlying patterns and structures in the input data, generative AI can develop new and unique output.
A form of encryption that allows computations to be performed directly on ciphertext without the need first to decrypt the data. With homomorphic encryption, data remains encrypted throughout the computation, and the result of the computation is also encrypted. Homomorphic encryption enables computations on encrypted data without exposing the underlying sensitive information.
There are two main types of homomorphic encryption: fully homomorphic encryption and partial or somewhat homomorphic encryption. The first allows for any computations on encrypted data, while the latter only supports limited computations.
Homomorphic Encryption can be slow and computationally expensive, to the point that it’s not commercially currently practical.
Inference data is used to make predictions or inferences with a trained machine learning (ML) model. It is the input data that is fed into the trained model, which then produces a prediction based on the relationships it has learned from the training data.
Inference data can differ from the training data used to train the ML model. For example, the training data might be historical data from a certain period, while the inference data might be real-time data from the present.
Jupyter Notebook is an open-source interactive computing platform that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. Jupyter Notebook supports various programming languages, including Python, R, Julia, and Scala.
Jupyter Notebooks provide a convenient way to perform data analysis and scientific computing, allowing users to mix code, output, and explanations in a single document. They are widely used in data science, machine learning, and scientific computing for tasks such as data cleaning and transformation, and statistical modeling.
Labeled data refers to data that has already been annotated or categorized with labels or tags that describe the content of the data. This data type is commonly used to train machine learning models for specific tasks, such as image classification, object detection, and natural language processing.
Machine learning algorithms use labels to learn the relationships between the inputs and outputs and to make predictions about new, unseen data. Unlabeled data is typically collected and annotated by human experts, who assign labels to each data sample based on its content.
Large language model (LLM)
A language model (LLM) is a form of artificial intelligence that relies on massive amounts of textual data to understand and produce language similar to humans. LLMs learn from the data by identifying patterns, structures, and connections among words and phrases, enabling them to answer questions, compose text, and a variety of language-related tasks. They are used for tasks such as text generation, translation, and summarization. ChatGPT is an example of a LLM.
Machine Learning Models
A machine learning model is a mathematical representation of a system capable of learning from data and making predictions or decisions. A machine learning model aims to capture patterns and relationships in data in a way that allows it to make accurate predictions on new and unseen data.
Some different types of standard machine learning models include linear and logistic regression, decision trees, random forests, support vector machines, and neural networks. The choice of model depends on the type of problem being solved and the data’s attributes.
A machine learning model consists of a set of parameters learned from training data and an algorithm that uses the parameters to make predictions. Once a model has been trained, it can be used to predict new and actual data. Machine learning models can be used for various applications, including image classification, speech recognition, and natural language processing (NLP). Machine learning models are being adopted as tools for automating decision-making and for discovering patterns and insights in data.
Model drift, also known as concept drift, refers to a phenomenon in machine learning where the distribution of the data changes over time, and the trained model’s performance degrades. This happens when the relationship between the input features and the target variable changes – causing the model to make incorrect predictions.
For example, consider a model trained to predict the performance of a football team based on its historical data. Over time, the team managers and players might change, causing the relationship between its team data and to change. In this case, the model trained on the historical data might not perform well on new data and would require retraining to maintain its accuracy.
Model drift can decrease accuracy, false predictions, and incorrect decisions. To prevent model drift, it is important to continuously monitor the performance of a model and retrain it as necessary. Techniques such as drift detection can be used to detect model drift.
MLOps, an abbreviation for Machine Learning Operations, is a set of practices and processes for managing the end-to-end lifecycle of machine learning models. It includes tasks such as model development, testing, deployment, monitoring, and maintenance. The goal is to ensure the reliable and efficient delivery of machine learning models into production.
MLOps aims to address some of the challenges associated with deploying and maintaining machine learning models in production, such as model governance, model monitoring, model deployment, automating the deployment of models into production, and updating them as needed, and model lifecycle management.
The machine learning (ML) life cycle refers to the stages in building, deploying, and maintaining a machine learning model. The problem that the ML model will solve needs to be clearly defined and understood. This includes identifying the type of data that will be used, the desired outcome, and the metrics that will be used to evaluate the model’s performance.
The ML life cycle can consist of but is not limited to the following stages: data collection, training unlabeled data, labeling, and model selection, evaluation, deployment, monitoring, and maintenance. The goal of the ML lifecycle is to develop and deploy machine learning models that are accurate, reliable, and can scale to deliver value to businesses or society.
Model deployment refers to making a trained ML model accessible and usable in a real-world production environment by integrating it into a production system and monitoring its performance. The goal of model deployment is to ensure that the ML model can be used to make accurate predictions on new data in a reliable, scalable, and efficient manner.
The steps involved in deploying an ML model can vary but typically include:
- ML model packaging: The ML model is packaged in a format easily integrated into the production system, such as a REST API or Python package.
- Integration with production system: The packaged ML model is integrated into the production system to receive input data and produce predictions.
- Performance monitoring: The ML model is monitored in real-world conditions to ensure that it is making accurate predictions and functioning correctly.
- Maintaining and updating: The deployed ML model may require updates and maintenance over time to address issues such as model drift, changes in the data distribution, or changes in the underlying technology.
Natural Language Processing (NLP)
Natural Language Processing (NLP) concerns interactions between computers and human (natural) languages. The goal is to enable computers to process, understand, and generate human language. NLP involves various tasks, such as text classification, sentiment analysis, translation, question-answering, and text summarization. These tasks are typically achieved through machine learning algorithms and deep learning models, such as neural networks.
Examples of NLP applications include chatbots, virtual assistants, language translation services, and text-to-speech systems.
Neural networks are a machine learning model loosely inspired by the structure and function of the human brain. Neural networks are designed to recognize patterns and relationships in data and make predictions based on this information. A neural network comprises collections of interconnected nodes (organized into layers) called artificial neurons, which process and transmit information. Each artificial neuron receives inputs, performs a computation, and produces an output passed to the subsequent layers of neurons. The connections between the neurons are associated with weights and biases, which are adjusted during the training process to improve the accuracy of the predictions.
Neural networks are widely used in various applications of AI, such as image recognition, speech recognition, and natural language processing. They are helpful in solving complex, non-linear problems and can be trained on large and diverse datasets, allowing them to learn from a wide range of data.
The optimization step is the iteration process of finding the best set of parameters or weights for a machine learning model to predict the outputs based on the inputs accurately. The process involves adjusting the model parameters to minimize a loss function, which measures the difference between the predicted outputs and the actual outputs. The goal is to find the set of parameters that minimizes the loss function and results in the most accurate predictions.
Several optimization algorithms can be used for this purpose, including gradient, stochastic and gradient descent. The optimization algorithm depends on the specific requirements of the machine learning problem and the type of model being used. The optimization step is a key part of the machine learning life cycle, as it determines the best parameters for the model to achieve good performance and accuracy.
Open banking is a financial services model that allows third-party providers, such as fintech companies, to access bank customers’ financial data with their consent. The goal of open banking is to promote competition and innovation in the financial services industry by enabling customers to securely share their financial data with trusted providers who can offer a broader range of financial products and services.
Open banking is typically implemented through application programming interfaces (APIs), which allow third-party providers to access bank customers’ financial data in a secure and controlled environment. Banks must make their APIs available to third-party providers, who can develop new financial products and services that leverage this data. Open banking raises concerns about the security and privacy of customer data and the potential for financial fraud and abuse. Open banking is typically subject to strong regulatory oversight and requires banks and third-party providers to implement robust security and privacy measures.
Personally Identifiable Information (PII)
Personally Identifiable Information, abbreviated as PII, refers to any information that can be used to identify a specific individual, such as name, address, driver’s license number, etc. When training AI models, it is important to protect PII as it is sensitive information and must be handled in compliance with privacy regulations, such as the General Data Protection Regulation (GDPR) or the Health Insurance Portability and Accountability Act (HIPAA).
PyTorch is an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. PyTorch is a dynamic computational graph framework, which means that the user can change the graph on the spot during runtime. This makes it favorable to work with, compared to static computational graph frameworks like TensorFlow.
PyTorch is known for being user-friendly and flexible. It provides APIs for building and training neural networks, as well as tools for data loading and preprocessing. PyTorch also supports a wide range of features for computer vision, natural language processing, and reinforcement learning. PyTorch has a growing ecosystem of tools and libraries.
Responsible AI is a set of practices and principles designed to ensure and promote the safe and ethical use of AI, accounting for aspects such as the safety and benefit of users and organizations, compliance with legal regulations, fairness, transparency, privacy, and accountability, to name a few.
Sensitive data is confidential, private, or protected by regulations. This may include personal information, financial information, medical information, and other types of data that must be protected from unauthorized access or misuse.
Examples of sensitive data include
- Personal information such as names, addresses, government id numbers, and date of birth.
- Financial information such as bank account numbers and credit card numbers.
- Behavioral data such as internet search history and purchasing patterns.
- Medical data that contains diagnoses and test results.
The handling and protection of sensitive data is a key issue in the development and deployment of AI systems because any unauthorized access or misuse of this information can have catastrophic consequences, such as identity theft, financial fraud, and loss of privacy.To protect sensitive data, AI practitioners employ techniques such as Stained Glass Transformation™, data anonymization, data perturbation, or other methods to reduce the risk of unauthorized use. They may also implement strict data access controls and monitor the use of sensitive data to ensure that it is being used responsibly in accordance with legal requirements.
Structured data refers to data that is organized in a tabular format with well-defined columns and rows. It is often stored in databases, spreadsheets, or other data structures designed to be easily processed and analyzed. Examples of structured data include data from transactional systems, such as purchase histories and inventory records, and demographic data, such as age and income. Structured data can be easily input into machine learning algorithms for training and prediction. Machine learning algorithms are also optimized for structured data and can often achieve high accuracy and performance when trained on this data type.
Not all relevant information can be captured in structured data. Incorporating unstructured data, such as text, images, and audio can provide additional valuable insights and opportunities for machine learning models. Structured and unstructured data are often combined and used in machine learning for optimal results. Structured data can also be referred to as tabular data.
Synthetic data is artificially generated data that is used to mimic real-world data. Synthetic data is often used for testing and training machine learning models, for benchmarking and performance evaluation, and protecting sensitive data in applications such as healthcare and finance.
It is generated using algorithms that model real-world data’s statistical and structural properties. This allows synthetic data to closely resemble real-world data regarding its distribution, relationships, and patterns while still being wholly artificial and not containing any sensitive information. Synthetic data can generate large amounts of data quickly, and bypass the privacy and security risks associated with using real-world data.It is important to evaluate the use of synthetic data carefully and to understand its limitations in various applications.
Semi-structured data does not follow any data model because it does not have a fixed schema. Unlike structured/tabular data, it lacks any rigid form. This type of data typically contains elements of both structured and unstructured data and is often found in databases and data warehouses. Examples of semi-structured data include JSON and XML files, which have both structured elements, such as key-value pairs, and unstructured elements, such as text and multimedia data. Processing and analyzing this type of data requires specialized techniques and solutions capable of handling both structured and unstructured data.
To effectively process and analyze semi-structured data for machine learning purposes, it is often necessary to first pre-process and convert the data into a format that can be used as input for machine learning algorithms. This may involve techniques such as data labeling. Semi-structured data can provide insights and opportunities for machine learning models, as it often contains rich and diverse information.
Tabular data refers to data organized in a table format with rows and columns. Each row represents an instance or an observation, and each column represents a feature or an attribute of the cases. Tabular data is commonly used as input for machine learning algorithms. It is easy to process and analyze using traditional data management techniques, and machine learning algorithms are often optimized for this data type. Tabular data is synonymous with structured data.
Not all relevant information can be captured in tabular data. Incorporating other data types, such as text, images, and audio, can provide additional valuable insights and opportunities for machine learning models and AI.
TensorFlow is an open-source software library for dataflow and differentiable programming across various tasks. It is a platform for building machine learning models, focusing on the training and inference of deep neural networks.
TensorFlow allows users to define and execute computations as a graph of tensors, which are multi-dimensional arrays. The library provides a range of tools and libraries for implementing and training machine learning models, as well as deploying models in a production environment. TensorFlow supports a wide range of platforms, from computers to cloud-based systems, and can be used for both research and production purposes.
Torch is an open-source machine learning library written in the Lua programming language. Torch is a scientific computing framework that provides an easy-to-use and flexible platform for building and training machine learning models. It offers several tools for building and training neural networks, as well as for data loading and preprocessing.
Torch is focused on speed and efficiency, making it suitable for large-scale machine-learning tasks. Torch also provides a number of pre-trained models and a large ecosystem of packages and libraries, making it easier to build and train new models.
Training data is a set of data used to teach a machine learning model to make predictions or perform a specific task. It includes input and output data that the model uses to learn patterns and relationships in the data. The model adjusts its parameters during the training process to minimize errors between its predictions and the actual output to achieve good performance on the task for new, unseen data.
A training model refers to a quantitative representation of a problem that is used to learn patterns and relationships in training data. A training model is designed to optimize its parameters for accurate predictions on unseen data. The training process involves providing the model with a set of labeled examples and adjusting the parameters of the model based on the accuracy of its predictions.
Once the model is trained, it can predict new, unseen data by applying the learned patterns and relationships to the input data. There are many different types of training models in machine learning, including linear models, decision trees, and neural networks. The choice of model will depend on the problem being solved, the nature of the data, and the desired level of accuracy. Training models are a fundamental step in developing machine learning systems, enabling the model to learn from data and make accurate predictions on new data.
Task-agnostic refers to algorithms or models that can be applied to various jobs, regardless of the specific task being performed. The same model or algorithm can be used for different problems, such as classification, regression, or generation. Task-agnostic approaches can be applied to a wider range of problems without additional modifications. They also make it easier to develop and implement machine learning systems, as they can be trained on multiple tasks and reused across different applications.
Examples of task-agnostic approaches in machine learning include transfer learning and multi-task learning. In transfer learning, a pre-trained model is fine-tuned on a new task, allowing the model to leverage its existing knowledge and reduce the amount of data required for training.
A trust boundary refers to a clear distinction between the parts of a system that are trusted to behave correctly and securely and those that are not. This distinction is essential in AI systems, as it helps ensure that sensitive information and critical decisions are made by trusted components of the system while reducing the risk of bad actors compromising the system or making incorrect decisions.
Examples of trust boundaries in AI systems include:
- Data processing: Trust boundaries can be established to ensure that sensitive data is only processed by trusted system components and is not accessed or manipulated by external parties.
- Model training: Trust boundaries can be established to ensure that machine learning models are only trained on trusted data and that outside parties do not compromise the training process.
- Model inference: Trust boundaries can be established to ensure that machine learning models are only used for prediction and decision-making purposes by trusted components of the system and that untrusted parties do not manipulate the predictions made by the models.
Establishing and maintaining trust boundaries in AI systems is essential for ensuring their reliability and security and for maintaining the trust of users and stakeholders in the system.
Unstructured data refers to data that cannot be easily processed and analyzed using traditional machine learning algorithms due to its lack of a predefined format. This type of data is often unorganized and difficult to work with, but it can also contain valuable information and insights that can be used to train machine learning models and inform decision-making. Examples of unstructured data in machine learning include text data, such as emails, documents, and social media posts, and multimedia data, such as images, audio, and video files. Processing and analyzing this data type often requires specialized techniques and tools, such as natural language processing (NLP) and computer vision, to extract meaningful information from the data.
To effectively process and analyze unstructured data for machine learning purposes, it is often necessary to pre-process and clean the data and then convert it into a format that can be used as input for machine learning algorithms. This may involve techniques such as text or image normalization, feature extraction, and data labeling.
Data not tagged with labels identifying characteristics, properties, or classifications’. Unlabeled data has no labels or targets to predict, only features to represent them. Unlabeled data can be useful in training a model.