3 min read

RoomPlan

The paper about RoomPlan is fascinating to read. The result may seem simple, but it is the output of carefully designed and optimized steps that open the door for improved semantic understanding of physical contexts.
iPad Pro displaying RoomPlan
3D Parametric Room Representation with RoomPlan
3D Scene understanding has been an active area of machine learning (ML) research for more than a decade. More recently the release of [LiDAR…
Screenshot displaying a canvas with the extracted bits and annotations for the RoomPlan paper

RoomPlan uses camera and LiDAR to make a 3D floor plan of a room (including dimensions and types of furniture).

The system is made up of two parts: 

  1. 3D room layout estimation (RLE)
  2. 3D object-detection pipeline (3DOD)
Timeline chart describing the sequential steps necessary for creating the RoomPlan “dollhouse” output

Room Layout Estimation

RoomPlan first finds the walls and openings, and then the result is used to figure out if the openings are doors or windows.

Figure 1: The figure shows the room layout estimation (RLE) process, starting with the walls and openings pipeline. RoomPlan predicts the walls and openings as 2D lines via our end-to-end line detector neural network. The lines are then lifted to 3D using our postprocessing pipeline, which leverages the estimated wall height.
Figure 2: We formulate the problem as 2D detection.The wall information is used to project the input into wall planes. This is later fed to our neural network, called the 2D orthographic detector, that predicts the location of the doors and windows in 2D planes. This output is later lifted into 3D along with the wall and camera information.
Figure 2: We formulate the problem as 2D detection. The wall information is used to project the input into wall planes. This is later fed to our neural network, called the 2D orthographic detector, that predicts the location of the doors and windows in 2D planes. This output is later lifted into 3D along with the wall and camera information.

3D Object Detection

3DOD goes through a three-step process that first recognizes and categorizes objects locally, then globally for more information, and finally uses a box fusion process to create the dollhouse result, or 3D representation of the whole room.

Figure 3: We create a wider frustum view by accumulating semantic point clouds. The local detector detects the oriented bounding boxes in a frustum, and aggregates them. Global detection then works with scene contextual information to detect large furniture. The accumulated results and boxes are combined to generate the final output.
Figure 3: We create a wider frustum view by accumulating semantic point clouds. The local detector detects the oriented bounding boxes in a frustum, and aggregates them. Global detection then works with scene contextual information to detect large furniture. The accumulated results and boxes are combined to generate the final output.

The making of an L-shaped sofa is an example of a relationship between objects during the fusion step. Also, if the object is not intersecting with any wall, then the correlation is calculated by aligning it with the closest one.

A animation showing the fusion step of a whole house scan before and after. The most significant items are combined and several walls are fixed.

🪴🪑📦 The chair’s category seems to be the most difficult to identify (83% precision) due to heavy occlusion or crowded arrangements.

Screenshot of the sample capture video failing to recognize a chair