Search Less. Discover More.
Today, after two years of hard work in silence, I am delighted to announce the launch of Sapiens, an online collaborative biomedical knowledge discovery tool. I’m also incredibly privileged to announce the names of investors who took an early bet on us, including Hike Ventures, OnDeck, Propelx and a few amazing angels including David Booth, Dr. Cassian Yee, Arthur de Garidel, Kevin Christoffersen, Dan Kemmer, Volker Rudolph among others.
The emergence of high resolution, multi-modal data at scale has resulted in an enormous amount of human knowledge scattered across data silos. Querying all that knowledge has become remarkably complex.
Over the last two decades, biologists and bioengineers have generated large swathes of data in distinct and siloed areas of interest. For example, generation of cellular systems that are more relevant to human disease than ever before, wherein, scientists today can transform an adult skin cell into an embryonic stem cell state using Yamanaka factors and then differentiate them into many different lineages and cell types resembling real human biology – a data-hungry and cross-disciplinary endeavor. Such advances are not limited to individual cells, but across organoids and multi-organ systems-on-chips (SOC) that allow us to replicate human biology in a scalable way.
Add to that the ability to perturb these systems using very fine-grained molecular scissors such as CRISPR which allow scientists to turn genes on and off to make edits at the level of individual base pairs so that we can now start to see what happens to individual cells if one changes single nucleotides (A –> C OR T –> G). Scientists measure these cellular systems in multiple modalities by harnessing advances in optics, mass spectrometry, sequencing and other techniques that generate hundreds of thousands of different readouts on each of these cellular systems: Data that needs to be processed, relayed to other groups, connected to their respective insights, and replicating breakthrough experiments and experimentation.
Why? Part 2: More Data than ever; More scattered knowledge than ever
Roughly 95% of the world’s data have been generated in the last 5-10 years. The emergence of high resolution, multi-modal data at scale has resulted in an enormous amount of human knowledge scattered across data silos. Querying all that knowledge has become remarkably complex. I experienced this challenge of searching through such scattered knowledge bases, firsthand and continuously over the years. But this became acutely painful in the summer of 2017, when I was founding my first techbio startup, Mekonos, now a fast-growing company pioneering delivery technologies to transform and scale gene-edited and cell therapies. To target potential customers for the Mekonos technology, I was trying to get answers to questions such as (a) which types of gene therapy programs have the greatest probability of success, (b) what obstacles might stand in the way of clinical and commercial success of the key gene and cell therapy programs, (c) which are the leading labs and companies that are developing T-cell therapies for solid tumors and so on and so forth. The best I could do was to go on Google or PubMed, **architect** a search and scroll through the top dozen or so different sources, pour through the references in these articles and go down a rabbit hole. I spent most of my time searching through different data sources instead of discovering new insights.
Why now? Breakthroughs in Natural Language AI + Serverless Cloud + Open access data
As a company, we were incredibly lucky that in late 2020 when we began our journey to organize and integrate the world’s biomedical knowledge from disparate sources and make them accessible, a parallel and potent set of tools emerged in Natural Language Processing (NLP) and Software and Data Management. NLP: Rapid advancements in training large-scale models efficiently and the abundance of text data on the internet, led to the success of Large Language Models (LLMs) that use the Transformer architecture. These new models appear to model language statistics (word co-occurrences) better than the conventional approaches that previously employed RNN or LSTM models. Now, it suddenly became possible to solve tasks that were more human than what previous NLP models were capable of: we could identify semantic relationships between concepts in the biomedical space, a core unit of the product that we are deploying today. Software: Database management systems and services have wisened up to the power of networks and network-based AI/ML since it offers one way of creating an explainable infrastructure to generate insights or inferences. Recent advances on this front facilitated the construction of very large knowledge graphs (KG) that can be maintained and updated in milliseconds. Data: there has been a tailwind for open-access data – public policy initiatives to make previously private data repositories public, has leveled the playing field to construct large knowledge-scapes that have previously never been explored. As of 2018, the most recent year for which there is comprehensive data, nearly 50% of all new scientific publications were open access. This trend shows no sign of slowing.
During my time as a Visiting Scholar at UC Berkeley (Go Bears!) while running Mekonos in parallel, I started thinking more about how one could marry the advancements in Transformer type large language models with the generational problem of scattered biomedical knowledge across disparate data silos. Discussing such issues over and over again with leading thinkers in this space such as well-known VC and my teacher at Cal, Shomit Ghose; Prof. Vwani Roychowdhury at UCLA; Prof. Russ Altman and Crystal Mackall at Stanford; Prof. Krishnendu Roy at CMaT and others, convinced me to start a company to solve this problem.
Characteristically, I named it NExTNet, as a portmanteau of NExT (my ‘next’ O-to-1 startup) and Net (resembling the rapidly growing semantic knowledge network powering the Sapiens platform). In the summer of 2020, as the COVID pandemic was at its peak and I was charting out the blueprint for this new company, I was incredibly lucky to meet one of the rising stars in NLP, Prof. Roychowdhury’s student at UCLA, Pavan Holur, who would eventually become the CTO of NExTNet. In the Fall of 2020, the work on Project Sapiens officially began and for the next couple of years, we kept our heads down building probably the most sophisticated Natural Language AI platform of its kind. In the succeeding months, we built an exceptional team.
We also quickly started realizing that the biggest pain point scientists face on a day-to-day basis is collaboration with their teams. When talking to our early adopters at leading companies, we often hear how as a scientist and being at the cutting-edge of science can be lonely. These problems are further exacerbated because the role of scientists in biotech and biopharma organizations is more cross-functional than ever before. Scientists are at the center of these R&D-heavy organizations: on any given day they might find themselves collaborating on a multitude of projects. And while they may be focused on a specific part, there is also a necessity to understand or even direct the larger effort, much of which may lay outside their area of expertise. The ability to quickly understand a novel, complex field is a necessity.
Software development has long been familiar with this challenge. Modern software design involves a collaborative team of engineers who might be distributed across different time zones all around the world. To meet this demand, there are a multitude of tools to collaborate on code effectively. When new tools are required, engineers develop them. If these novel tools are good enough, then other engineers adopt them in the industry. This results in software becoming better, cheaper, faster and easier to use as more engineers get to participate in its development.
Compare that to the workflow of scientists who are developing life-saving therapies and climate improving biomolecules. Much of this work is still confined to paper, email and Excel spreadsheets which just sit there and collect dust. While modern tools like Benchling have made cloud-based sharing and collaboration on files easier for scientists, when it comes to collaborative scientific discovery, hypothesis formation and knowledge sharing, scientists are still in the dark ages. From asking questions to querying a myriad of disparate data sources to commenting to sharing to storing, no single tool tackles the entire workflow of scientific discovery in one place.
When we started working on Sapiens, we knew it was possible to build a fast and stable knowledge querying tool in the browser that can scale with very low latency and high concurrency, but little did we know how hard it would be. From rendering the knowledge graph with an immersive and intuitive UI in the front-end to building sophisticated knowledge and relationship extraction NLP pipelines in the backend to the middleware for storing and delivering data for user queries to a multitude of performance edge cases, getting here has been challenging. Specifically, harmonizing the multitudes of biomedical data is an arduous task due to the unique characteristics of each biological database; however, using our NLP and KG, we’re unraveling novel relationships between biological databases which can’t be explicitly discovered with existing tools.
Scientists in the biotech and biopharma industry have high expectations for a tool that they rely on daily! After dogfooding Sapiens internally for the past eighteen months and partnering closely with our alpha and beta customers over the last Quarter, I’m confident that our team has reached this high bar.
While the technological breakthrough of developing state-of-the-art NLP on multimedia data: text, images, molecular etc. coupled with our Explainable Graphical User Interface (GUI) in the browser is exciting, I’m even more excited by the ChatGPT style natural language querying and the live collaboration possibilities, we’re beginning to unlock. Whether you are asking scientific or commercial questions (e.g., What are the biomarkers for Disease X? What are the key activated pathways in my scRNASeq data? What are the growth areas for off-label use for drug Y? What patents have been filed recently for treatment X? etc.), sharing your scientific discovery/hypothesis with a link, giving live contextual feedback to your peers or see tagged content on the Graph that may have massive analytical value for your team, Sapiens makes it easy to collaborate to ask and answer complex questions, discover hidden connections from disparate data, without having to master coding, querying languages, or arcane statistics.
Today, we’re announcing our Free-tier release: a way for biotech teams to get early access to Sapiens and help shape our product roadmap. With the Basic Free-tier version (e.g., logging in with your Google or business email), you can discover insights across disparate molecular data sources. With the Academic Free-tier version (e.g., logging in with your .edu email), you can discover insights from both molecular and scientific literature (currently in beta) data. We are building the largest, human-curated reinforcement learning data set for determining the relevance of scientific topics to each other. Sapiens becomes smarter and faster with more data and human-in-the-loop learning. And we’re just getting started, so right now the only cost is your feedback and interaction with Sapiens.
Here are some specific things we’re really excited about in 2023:
- Ability to query Sapiens in Natural language. It’s like using ChatGPT in 3D (!!!)
- Live collaboration and simultaneous, multiplayer editing.
- Uploading your internal/experimental data and contextualizing.
- Bioentity relationship prediction, leveraging NLP and perpetually refreshed biomedical data.
And here are some of the features that you can access today. We also have a whole content library to get your started. Check it out! If you have suggestions or comments, please forward them to firstname.lastname@example.org
The possibilities of what you can do with Sapiens are endless! We can’t wait to see all the great things that you will discover. Huge Thanks to Pavan Holur, Derek Park, Tyler Myers, Manish Singhal, Dan Goodman, Emmanuel Alofe and Valeriya Kalkina for reading drafts of this.
Leave a Reply