Glossary

This glossary covers general definitions in the data and civic technology sector along with how they can be applied within an open data program and the California open data community. The goal is to create a standard vocabulary to enhance communication, foster collaboration and improve interoperability of California’s data sets.

Where appropriate, definitions include a label guiding users to the most relevant category on Data.ca.gov, where additional documentation is available.



Categories

General Open Data Terminology
Data Prioritization
Preparing Data for Publication
Metadata
Data Dictionary
Data Publishing Process
Open Data Platform Account Hierarchy

API (Application Programming Interface)

Software that allows machine to machine communication over the internet. For data, APIs allow apps to read just the data they need directly, without downloading an entire dataset, saving bandwidth and ensuring that the data used is the most up-to-date available.

Category: General Open Data Terminology

CA Public Records Act Request (PRA Request)

The California Public Records Act (Statutes of 1968, Chapter 1473; currently codified as California Government Code §§ 6250 through 6276.48) was a law passed by the California State Legislature and signed by the Governor in 1968. The law defines what state and local government records are open to public inspection. A PRA (Public Records Act) request is either a verbal or written request given to a government entity by a person to obtain specific public records. State agencies generally have ten days to respond to a PRA request, and what happens after that depends on the state organization.

Category: General Open Data Terminology

CSV (comma separated values) File

A standard format for spreadsheets where data is stored in a plain text file, with each data row on a new line and commas separating the values on each row. As a simple open format it is easily read by computers and is widely used for publishing open data.

Category: Preparing Data for Publication

Data Anonymization

When data contains information that is too detailed it may directly or indirectly provide personally identifiable information about a person. To prevent this, it is important to anonymize data before it is published, typically by removing the confidential information from a dataset. Click to download the California Health and Human Services de-identification guidelines in a PDF. Also see de-identification.

Category: Preparing Data for Publication

Data Coordinator

The Data Coordinator works under the authority and permission of the Executive Sponsor of their respective Group. The Data Coordinator identifies data, oversees the Data Stewards and has the authority to approve the publishing of datasets on the portal. In a combined role, the Data Coordinator can also serve as the Data Steward. Click here to see a visualization of the portal platform roles.

Category: Open Data Platform Account Hierarchy

Data Portal

A web platform for publishing data which provides a data catalog, making data not only available but also findable for data users, which offers a publishing workflow for organizations. Typical features are web interfaces for publishing and for searching and browsing the catalog, machine interfaces (APIs) to enable automatic publishing from other systems, and data preview and visualization tools or applications. Data.ca.gov is an example of a platform.

Category: General Open Data Terminology

Data Steward

The Data Steward is the person most knowledgeable about the data including the sources, collection methods, and limitations. The Data Steward should be able to write metadata describing the data and answer questions about the data or the program the data derives from. The Data Steward cleans, uploads, and edits the data of a group on Data.ca.gov and communicates with the Data Coordinator for publishing approval. Sometimes, the Data Steward is also the Data Coordinator, who has authority to publish the data. Click here to see a visualization of the portal platform roles.

Category: Open Data Platform Account Hierarchy

Data Users

Any individual or organization that accesses, downloads, analyzes, or who uses data to develop apps, visualizations, reports, and other information products or services. Data users are important stakeholders for your open data projects.

Category: General Open Data Terminology

Dataset

A dataset is any organized collection of data. The most basic dataset is composed of data elements in a table. Each column represents a particular variable. Each row corresponds to a given value of that column’s variable. A dataset may also present information in a variety of non-tabular formats, such as an extended mark-up language (XML) file, a geospatial data file, or an image file. Dataset is a flexible term and may refer to an entire database, a spreadsheet or other data file, or a related collection of data resources.

Category: Preparing Data for Publication

De-Identification

De-identification is the process of deleting or masking personal identifiers, such as name and social security number, and suppressing or generalizing quasi-identifiers, such as date of birth and zip code prior to publishing the data on Data.ca.gov. This is used to prevent public information from being connected back to the person it is about. See anonymization.

Category: Preparing Data for Publication

Executive Sponsor

Within the California Open Data Portal program, the Executive Sponsor role is a director-level position or an organization’s executive who has the authority to recommend, approve and publicize data collected by the state entity they represent. The Executive Sponsor identifies a Data Coordinator to carry out the vetting and publishing of their organization’s data.

Category: Open Data Platform Account Hierarchy

Flat Files

A flat file is an informal term for a single table of data from which all word processing or other structure characters or markup have been removed. A flat file stores data in plain text format. Because of their simple structure, flat files can only be read, stored and sent. CSV files are one of the most common types of flat files.

Category: Preparing Data for Publication

GitHub

GitHub is a code-hosting platform for version control and collaboration. It allows users to work together on projects from any location and supports open source development.

Category: General Open Data Terminology

Group

A Group is the representation of a state organization on Data.ca.gov. Each Group has a minimum of one Data Coordinator, who authorizes publishing of the data with approval from the Executive Sponsor of the organization. Groups can also have Data Stewards who upload the data, metadata and data dictionaries, but do not have the authority to publish the data. Both Data Stewards and Data Coordinators have permissions to upload data, resources, and edit the Group’s description. Click here to see a visualization of the portal platform roles.

Category: Open Data Platform Account Hierarchy

Machine Readable

Information or data that is in a format that can be easily processed by a computer without human intervention. To be machine readable, data must be structured in an organized way. CSV, JSON, and XML among others, are formats that contain structured data that a computer can automatically read and process.

Non-digital materials such as photos and handwritten documents are not machine readable even when scanned. For example, a PDF document containing tables of data is digital but is not machine-readable because the tables are still simply images.

Category: Preparing Data for Publication

Open Data

Data is open if it can be freely accessed, used, modified and shared by anyone for any purpose (http://opendefinition.org/). For Data.ca.gov, open data is regularly updated and comes from an authoritative source.

Category: General Open Data Terminology

Portal Owner

A Portal owner is the entity that has direct ownership of an Open Data Portal. This can be at the statewide, agency, department or local government level.

Category: General Open Data Terminology

Shapefile

The shapefile format is a popular geospatial vector data format for geographic information system (GIS) software. The shapefile format can spatially describe vector features: points, lines, and polygons, representing, for example, water wells, rivers, and lakes.

Category: Preparing Data for Publication

Structured Data

Structured data refers to information with a high degree of organization, making the data readily searchable by search engines. Tall versus wide is best to establish publishing consistency.

Category: Preparing Data for Publication

Topic

The main theme or category of a data resource. Differing from tags, a topic describes the data resource within Data.ca.gov. Examples are Water, Recycling, or Buildings.

Category: Data Publishing Process

Visualization

A visual representation of data, such as a chart, graph or dashboard, is often the easiest way of communicating with data, bringing out its key features. Many visualization tools exist such as Google Charts, Excel, ArcGIS, Tableau, and PowerBI. Creating a dataset’s visualisation requires careful attention to the meaning of the variables, the relations between them and the stories inherent in the data, to design a visual representation that lets the message of the data shine through.

Category: General Open Data Terminology

Application

A piece of software, designed to run on the web or on mobile phones, that connects to large databases. Applications are a way of consuming open data, and are real-time, personalised, and location-specific. Crowdsourcing applications can also be used to build, edit, or improve datasets. “App” is shorthand for “Application.”

Category: General Open Data Terminology

Catalog

A catalog is a collection of datasets or data resources. Data.ca.gov has one catalog for all types of datasets at https://data.ca.gov/search/type/dataset. The catalog contains both geospatial and non-geospatial datasets.

Category: General Open Data Terminology

Data

Data include lists, tables, graphs, charts, and images. Data may be structured or unstructured and organized; and can be geospatial or location neutral. Data become “information” when analyzed and possibly combined in order to extract meaning and to provide context and insight.

Category: Preparing Data for Publication

Data Cleaning or Scrubbing

Various processes to make a data resource easier to use. Data cleaning may involve fixing inconsistencies and errors, removing non-machine-readable formatting (such as styles and formatting), using standard labels for row and column headings, ensuring that numbers, dates, and other quantities are represented appropriately, or converting the table to a preferred file format.

Category: Preparing Data for Publication

Data Dictionary

For Data.ca.gov, the data dictionary is a list of the variables in the data resource with their format (text, number, date, etc.), an explanation of what values the user will find there, and, if only certain values can be entered for that variable, a list of those values. Please refer to the CA Open Data Publisher Guide on how to create a data dictionary.

Category: Data Dictionary

Data Resource

For Data.ca.gov, the individual tables, data dictionaries, and visualizations that comprise a dataset, along with the associated metadata to make them findable and usable.

Category: Data Dictionary

Data Story

Organizations often use open data to tell a story. By adding context, both history and implications, as well as visualizations and examples, to an analysis, organizations can make their open data more relevant, timely and actionable for stakeholders. As a result, data stories can help describe or influence policy, organizational or business decisions. For example greenbuildings.ca.gov. Data stories can describe or influence a business decision or action. Data stories can also be a narrative that explores and explains how and why data changes over time, usually through a series of linked visualizations.

Category: General Open Data Terminology

Database

Can be a software system for processing and managing data, including features to update, transform and query the data. Examples are PostgreSQL (open source) and Microsoft Access (proprietary). A database can also refer to a set of data.

Category: General Open Data Terminology

Datastore

From the DKAN documentation: DKAN Datastore bundles a number of modules and configuration to allow users to upload CSV files, parse them and save them into the native database as flat tables, allowing users to query them through a public API. To get the fullest functionality possible out of your datasets, you should add your CSV resources to the datastore.

Category: Preparing Data for Publication

DKAN

DKAN is the Drupal-based open source data platform that Data.ca.gov uses for its open data efforts. DKAN allows governments to publish data to the public, provide visualizations and data stories and create internal analytics dashboards.

Category: Data Publishing Process

File Format

The file format refers to the internal arrangement (format) of the file, not how it is displayed to users. For example, CSV and XLS files are structured very differently, but may look similar or identical when opened in a spreadsheet program. The format corresponds to the last part of the filename or extension.

Category: Preparing Data for Publication

GIS (Geographic Information System)

A system designed to capture, store, manipulate, analyze, manage, and present all types of geographic location data, allowing the user to question, analyze, and interpret data to understand relationships, patterns, and trends. GIS information is stored in layers of spatial data in a format that can be stored, manipulated, analyzed, and mapped.

Category: General Open Data Terminology

Granularity

Granularity is the degree of specificity or scale of the data in a data resource. The more granularity, the greater the level of detail in the data, allowing more flexible processing of the data by users. For example, census blocks are smaller than census tracts. There is more data available about a census tract, but the data for a census block is more detailed because it represents a much smaller geographic area.

Category: General Open Data Terminology

JSON (JavaScript Object Notation)

A simple format for data that can describe complex data structures, is both machine-readable and somewhat human-readable, is independent of platform and programming language, and has become a format for data exchange between apps, programs and computer systems.

Category: Preparing Data for Publication

Metadata

Metadata is information about a dataset that makes the data easier to find or identify. Metadata includes the title and description, method of collection, limitations author, publisher, area and time period covered, license, date and frequency of release. Metadata describes the dataset’s structure, data elements, its creation, access, format, and content.

Category: Metadata

PDF (Portable Document Format)

PDF is a multi-platform file format. A PDF file that provides an electronic image of text or text and graphics that looks like a printed document and can be viewed, printed, and electronically transmitted. A PDF is not machine-readable and is not considered open data. See machine-readable.

Category: Data Publishing Process

Resources

Within Data.ca.gov, resources are the actual files, APIs or links that are being shared through the portal. Resource types include csv, html, xls, json, xlsx, doc, docx, rdf, txt, jpg, png, gif, tiff, pdf, odf, ods, odt, tsv, geojson and xml files. If the resource is an API, it can be used as a live source of information for building a site or application.

Category: Data Publishing Process

Spatial

Spatial is a metadata term that means the dataset has locational (geographic) information such as coordinates, address, city, or ZIP code.

Category: Metadata

Tag

A tag is a keyword or term assigned to a piece of information or a file. This type of metadata helps describe an item and allows it to be found by browsing or searching.

Category: Data Publishing Process

Unstructured Data

Unstructured data (or unstructured information) is information that either does not have a predefined data model or is not organized in a pre-defined manner, such as a flat file. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

Category: Preparing Data for Publication

XML

Extensible Markup Language, is a flexible file format designed to store, transport and share data over the Internet. XML is both human- and machine-readable.

Category: Preparing Data for Publication

Resources