BIM and Data Science
BIM, Data Warehouse, and Data Science Workflow.
Key Questions
Data science involves extracting information from various data resources and formats (e.g., multidisciplinary BIM models, 4D, 5D, COBie, etc. in the context of BIM)—often referred to as the Extract Transform Load (ETL) process—to compile the data needed for solving real-world problems in a specific domain. The key in data science is to ask questions that the data can answer. Basically, there are five key questions that data science answers, as shown in this figure.
Five questions that data science can answer.
Extract Transform Load (ETL)
ETL is the process used to extract information from various resources and formats and transform that raw data into a uniform data structure, which is then suitable for data warehousing and data integration. Figure 3 shows examples of building-related information that can undergo the ETL process before being stored in a data warehouse. BIM databases usually contain three types of data: structured, semi-structured, and unstructured. Each of these needs to go through the ETL process, described below:
Extract: The initial part of the ETL process involves extracting valuable BIM data from a variety of data resources such as BIM models (e.g., model elements such as doors, levels, and spaces), Excel spreadsheets, HTML files, and flat files. In the extracting process, dark data (data that is not used for other purposes) may be needed.
Transform: The next process is transformation, which is one of the most important ETL processes involved in preparing proper data for the target data warehouse. In the transformation process, a range of functions is applied to the data. The functions include identifying the data types, cleaning the data, finding missing values, and designating desired columns. These functions are applied to the extracted data to ensure the quality of the data before it is loaded into the data warehouse and into the uniform data structure.
Examples of building-related information that can be extracted, transformed, and loaded into a data warehouse.
Machine Learning Algorithms
Machine learning is a subfield of data science (see Figure 1). Information that has undergone the ETL process and is stored in a data warehouse facilitates machine learning. Machine learning depends on a dataset and algorithms to predict answers to questions without explicit programming. As a simplified description, machine learning involves training datasets by using various kinds of machine learning algorithms.
For example, assume a user wants to predict whether information is true or false or missing in a BIM model as compared with the BIM standards, project standards, the BIM execution plan, or other BIM documents. Machine learning algorithms can be used to make predictions based on a given element feature (i.e., line weight, color, line pattern, etc.) as input data. Responses (output) may be true/false/ missing based on the algorithms applied to the BIM dataset. Output can also produce predictions to new data that is processed by ETL. A simple example to illustrate this concept would be email filtering algorithms that identify whether a given email is spam or not spam. The machine can be trained to continue the process until the desired level of accuracy is achieved.
Two commonly used kinds of machine learning algorithms are supervised learning algorithms and unsupervised algorithms. They are described below:
Supervised Learning Algorithms. Supervised learning algorithms are the most commonly used and significant for BIM because elements are categorized in the system and both input and output data can be trained easily so the system can generate accurate insights to resolve problems. The following are commonly used supervised learning algorithms:
- Classification
- Regression
Unsupervised Learning Algorithms. Unsupervised learning algorithms do not have category information. Additional element features (for example, element shape and size) must be added in order to categorize the elements. The purpose of unsupervised learning algorithms is to identify patterns in the data to categorize for output. The following are commonly used unsupervised learning algorithms:
- Clustering
- Dimensionality reduction
The following two examples illustrate possible problems in the BIM data. Figure 4 shows machine learning algorithms with one false output and five true outputs. Figure 5 shows a machine learning trained dataset and machine learning algorithms output, Model A and Model B. Model A has a false Level 3 because it is inconsistent with the trained dataset (left hand column) and the Model B output is missing Level 4.
A supervised learning classification algorithm identifying a false output.
A supervised learning classification algorithm identifying false and missing information.
Data Visualization
Machine learning output can be presented in graphic form to help users identify where the data is inconsistent or incomplete. For example, AECOM has internally developed a tool that executes the ETL process and machine learning algorithms to predict problems in the BIM data and provides reports and insights using the following visualization tools. Data visualization can be generated at several levels as needed to support a project team. It can be generated at the project level (Figure 6), model level (Figure 7), and element level (Figures 8 and 9). Providing visualization at these levels allow users to view the same information, but at different levels of detail.
Data level visualization showing the overall project status of information
Data visualization showing the status of model level information
Data level visualization showing the status of element information and the machine generated disciplines Task Manager to resolve problems.
Data level visualization showing the status of element information; the results indicate missing grid elements
Figures 10 and 11 show visualizations that illustrate a successful outcome for data science using ETL and machine learning algorithms. These examples demonstrate accuracy of all the models.
Example showing model level elevations that are consistent in all the models.
Example showing column grids that are consistent in all the models.
Conclusion
The use of advanced machine learning algorithms can significantly reduce the amount of time required for identifying BIM data discrepancies by humans and for verifying information manually. Data science helps generate data visualizations and reports within minutes and makes it possible to share information over the cloud, which can be accessed from anywhere and on any secure device. Visualizations and reports can be generated on an hourly, daily, and weekly basis. The Architecture, Engineering, Construction and Owner/Operator industry has invested heavily in BIM technology and BIM standards, and has developed execution plans, policies, and procedures related to it. Using BIM data effectively is key to making better business decisions. The examples included in this article reflect the basic concepts that can be applied. The use of machine learning algorithms has only just begun: Our task is to explore the use of advanced algorithms to improve the use of BIM technology in AEC.