AI-based Document Classification Firm Concentric Emerges From Stealth
Concentric Emerges from Stealth with AI Document Classification Product and $7.5 Million Seed Funding
Unstructured documents — especially those that have been given wrong or no sensitivity classification — are among the most difficult assets for any enterprise to track and secure. Problems come from staff inappropriately sharing and insecurely storing documents. Ensuing threats go beyond the compliance concern of leaking personal data, and include the danger of sensitive commercial data falling into the wrong hands.
San Jose-California based Concentric has emerged from stealth with the availability of a new deep learning solution called Semantic Intelligence. It uses language analysis to determine the sensitivity of individual documents to help solve and prevent this problem. At the same time, Concentric has raised $7.5 million seed funding from Clear Ventures, Engineering Capital, Homebrew and Core Ventures. Concentric was founded in 2018.
In a separate report (PDF) published January 29, 2020, Concentric provides the result of analyzing 26 million unstructured documents from companies in the technology, financial and healthcare sectors. It found that each company has just short of 10 million unstructured documents. Each employee owns almost 2,000 documents. Among these, each employee owns 253 business critical documents — and among these, 38 documents per employee are at risk. Over 627,000 source code files and over 1 million trading files were also found.
But Concentric did not simply find files that were at risk, it found files that were actually risked. Per employee, five business critical documents were erroneously shared with an external party. Twenty-one were improperly shared with other groups. Nine were erroneously shared with internal users. And three business critical documents were wrongly classified.
Manual classification of this volume of documents requires extensive staff training and is prone to error. Manual classification done in arrears is so costly and time-consuming that it is a project often delayed, sometimes indefinitely. Existing automated rule-based methods of searching documents for key words or phrases leads to large numbers of false positives, causing many documents to be over-classified and reducing the general availability of data to the company.
Concentric brings deep learning language analysis that can analyze context. It can tell the difference, for example, between a personal email quoting the dollar-value of a home, and the dollar-figure quoted in sales or M&A documents.
“Discovering and protecting unstructured data is a huge problem,” Concentric CEO and founder Karthik Krishnan told SecurityWeek. “The challenge is that this data is complex: contracts, NDAs, source code, design documents, and so on. Traditional methods of discovery have relied on using word patterns, but this lacks the context to be able to accurately classify the document. The result is that most companies don’t know where their high value assets are.”
Meanwhile, he continued, “deep learning has progressed to the point where it can both solve problems at scale and do it with a degree of precision. What we have built is a system that uses a deep learning language model to develop a semantic level of understanding of the context. We can look at both the words and how they are used within the broader context of a document to understand the meaning. This allows us, in a completely unsupervised manner, to build thematic groups, putting contracts, design documents, NDAs into their own groups.”
By then analyzing and comparing documents within their groups, he explained, the Semantic Intelligence product can understand “how the data has been identified or classified or shared across the business units to provide a risk-based view over that data. The idea is that business-critical data combined with how it has been shared, whether it has been shared with the right sets of people, provides a view into the risk. We could compare a design document with another design document and look for signs of risky sharing where a document might have been shared inappropriately. This is all autonomously derived without a single rule or regular expression or a policy function that needs to be defined up front. It’s all driven by the thematic groupings that we build using our deep learning models. The goal is to help companies discover and protect their unstructured data.”
Semantic Intelligence uncovers, categorizes and classifies the documents, and allows IT and security teams to monitor data security with timely information and risk visualizations that drill down into the at-risk documents. The solution also integrates with major third-party security and data stores to help customers leverage the security investments they already have in place.
“Businesses understand the importance of protecting their critical assets, and yet, despite their best efforts, an extreme amount of data is left unsecured, unidentified, misclassified and at risk,” said Krishnan. “Unstructured data is currently copious and dispersed, and it includes an alarming amount of business-critical information. It’s a target for cybercriminals and can be a pitfall for regulatory compliance, but securing it is incredibly difficult. It’s the data challenge of our digital generation that we’re laser-focused on solving.”