Proptech and Real Estate Document Reading: It’s Complicated

AI satisfies more of the industry’s demand for accurate and timely data abstraction — but humans are for now still essential


Digitalizing the real estate paper chase is proving to be a difficult task.

Increasingly effective use of artificial intelligence (AI) and machine learning are improving efforts to decipher, abstract, and make more accessible the industry’s arcane, complex and seemingly endless amount of data in documents, but 100 percent accuracy remains an elusive goal, according to experts in the field.

SEE ALSO: 2023 Young Professionals

And humans very much remain in the loop.

“Advancements in generative AI can help you be smarter and faster when it comes to document reading in lease and insurance data abstraction, but it’s not going to totally replace human expertise,” said Omri Stern, co-founder and CEO at Jones, a Manhattan-based proptech firm that uses such technology to help property management and construction companies reduce insurance risk.

“We see generative AI as having the potential to give us a high degree of confidence in how we are parsing and extracting data from documents,” said Stern. “But it is, at the moment, not able to completely replace humans, because, as it turns out, leases and specifically insurance language and contractual agreements are written in a relatively complex way that doesn’t actually produce the accuracy we expect to see in our work.”

AI comes in different forms, and generative AI such as ChatGPT can take what it has learned from the examples it’s been shown and create something entirely new based on that information. ChatGPT uses Large Language Models (LLMs), a form of machine learning that can generate novel combinations of text in the form of natural-sounding language.

“I actually spoke with a friend who works at Open AI [an AI research and deployment company], and he said that it’s more than an invention — it’s more like a discovery,” said Michael Rudman, Jones’s chief technology officer and co-founder, referring to generative AI. “They add more and more parameters into models, and suddenly they start to realize that it’s starting to understand context. And they’re still discovering ways to work with it.”

San Francisco-based Prophia has been using AI for commercial real estate document reading and abstraction since its founding in 2018, said Cameron Steele, co-founder and CEO at Prophia. The proptech company reviews about 40 different types of documents, and client demands for real-time analysis of them is growing, said Steele.

“We target about 200 different what we call concepts, which are very detailed across all these different document types,” he said. “When we started, we thought there’s probably 30 concepts we wanted to target, but our customers have continued to ask us for more and more data. The challenge is because a lot of the governance and management of buildings is tied to contractual documents. All that data is trapped in these legal agreements.” 

To extract that data, the commercial real estate industry has historically relied on people reading things and putting them into spreadsheets, Steele added. “That’s still very much alive, but it’s going away.”

For Prophia, the key to customer satisfaction is providing real-time access to verified data, Steele said. “We’re seeing a lot of demand from our customers because they want access to this data real time and they want to trust it because they don’t trust a lot of the data they have today.” Client demands for more and more data puts companies like Prophia in a situation akin to walking the wrong way on a moving sidewalk, said Steele.

“Every tenant we onboard on behalf of our customers has about seven documents and about two-thirds of all of the annotations we do are through auto annotation,” he explained. “The remaining third, we’ll do both technical audits as well as human review to make sure the existing annotations are done accurately. That takes us about 30 to 40 minutes per tenant today. The bigger burden on us is the review, which is harder to build tools around, but we’re working on that.”

One particular pain point in real estate document abstraction is the mortgage origination sector. High interest rates and home prices have crushed the margins of that sector, said Nima Wedlake, principal at Thomvest Ventures. Thomvest is a San Francisco-based group of investment companies that makes investments across proptech, fintech and other industries on behalf of billionaire Peter J. Thomson and his family.

“We are investors in a company called Ocrolus, which is focused on extracting relevant information out of documents of any type,” said Wedlake. “They started in commercial lending, but they’ve since expanded to have quite a large set of customers in the mortgage space. They also are playing an important role in the tenant application space by automating income verification, checking for fraud, and extracting relevant details out of a credit application, for instance.”

Thomvest Ventures plans to invest in other proptech companies automating real estate data extraction, he added.

“Accuracy is so important in any sub-segment of financial services or real estate,” said Wedlake. “I think it will take a while before we can fully outsource a mortgage origination experience to AI. There will continue to be a human in the loop.”

Along with accuracy issues, there will be a greater regulatory component for companies abstracting data, he said.

“We’re far from the days of basically interfacing only with AI to get a loan or replace the loan officer or relationship manager at a bank,” Wedlake said. “The places where I think we’ll see the most action and where I’m seeing companies being formed is around the co-pilot concept — let’s take the work that a loan officer does and try to scale their effectiveness 10 times by augmenting them with some sort of AI that can either accelerate the processing of information or respond quickly to inbound customer requests.”

Following 20 years at EY, the last five of which he spent building AI products on a global scale, Vijay Anand about 18 months ago joined Solon, Ohio-based MRI Software as vice president for AI and data products to “jump-start AI” there, he said.

One of the biggest challenges Anand faced at MRI was the uniqueness and complexity of real estate leases, which makes data extraction particularly difficult compared to other areas in which AI is used.

“It’s a lot more difficult to do than facial recognition that you do on your Facebook feed or your Google photos, because they have billions and billions of photos that they train on, and people tag, and so on,” Anand said. “Leases are written by lawyers and are different regionally. What one firm calls something — renewal rent or rent roll — is called something else by another.

“But we have implemented what I would call a state-of-the-art AI extraction for NLP, natural language processing, that is one of the, I want to say unglamorous elements of AI, that we have been using other than in the chat functionality.”

At this point in time, using LLMs such as ChatGPT does not work in accurately abstracting real estate documents, Anand added.

“The only reason why very big companies with big investments have been able to [use LLMs] now is because they are training on trillions of data points that they have been able to do across any domain,” Anand said. “So they’re not focused on real estate. Even in our own experiments, trying it with the large language models, if you [submit] a real estate lease, even up until April, you could only send two pages of lease data into ChatGPT. Now they are able to accept, if you have access to the latest model, about 100 pages. But our typical leases are an average of 100 pages and more, so we could not even use the whole lease.

And then there’s still a lot of room for error because domain knowledge is so important. “ “You need to know real estate and you need to know real estate leases by region, the type of lease — whether it is commercial, residential, occupier-owner — all of those things change the language and the meaning across the world.”

The accuracy rates of real estate data extraction vary from company to company depending on how many data points a client wants abstracted from documents. Generally proptech firms focused on this function claim 80 to 95 percent accuracy.

“The volume of data that we extract depends on the language of the lease, the type of complexity of the lease, as well as what you’re trying to extract,” said Anand. “An average client extracts over 600 pieces of data from a lease — anywhere from a clause to co-tenancy, to the rent roll, which forms about 30 percent of the lease. We can get an extraction accuracy of well over 85 percent.”

In addition, MRI has introduced “forms intelligence” to complement real estate documents through the use of highly structured and more concise forms that yield between 90 and 95 percent accuracy, said Anand.

“I want to tell you that the human in the loop is the most critical element of AI accuracy,” he added. “What we do with our software is any time we have a human in the loop review, we have a managed services team that does that because our clients want our SLA [service level agreement] over 95 percent accuracy. In order to do that there’s a human element and that varies from time savings of over 50 to 70 percent, depending on what you’re trying to extract and how complex your leases are.”

As Prophia’s Steele summarizes his company’s mission: “I tell our team every time we get together we’re in the trust business. [Clients] have to trust us with their data and that we are providing them with verifiably accurate data. It’s like, trust but verify. It’s fundamental to our business.”

Philip Russo can be reached at