Skip links

Meet dStill

Technology needs an amigo or three.

“Park it right here, pal. You’re going to want to be sitting down for this.” Every corner I turn, I’m met with a buzz that feels more like a great awakening of ignorance in today’s real assets scene. What we’re celebrating as innovation is often just a plethora of mediocrity. While the buzz around Generative AI and LLMs (Large Language Models) within real assets is partly justified, much of it is overhyped, and only a select few of these applications may truly revolutionize our sector. After all, LLMs aren’t new kids on the block -they’ve been around since the ‘60s.

Large language models (LLMs) are AI powerhouses trained on a plethora of datasets to decipher, generate, and tweak human language. Their architecture skillfully prioritizes words to craft contextually rich text. Like a literary parasitoid, these models gorge on text—books, articles, websites, all forms of written word—learning patterns, structures, and the subtleties of language, eventually overwhelming their host in their lifecycle. Evolution has a twisted sense of humor. Half jokes aside, their downstream capabilities are boundless, from generating coherent responses and creating content, to mastering translation nuances, summarizing vast texts, answering questions, and yes, even waking my 14-year-old and making her belgium waffles.

Indeed, a plethora of innovation has emerged. Let me illustrate this with a scene from the 1986 film “The Three Amigos.” In this cleverly scripted moment, the main villain, El Guapo, is organizing his birthday and questions his sidekick, Jefe, about the preparation of a “plethora” of piñatas. Jefe, oblivious to the actual meaning of ‘plethora’ but eager to impress, assures him that indeed he has. Skeptical, El Guapo delves deeper, triggering a comedic dialogue as Jefe’s confusion over the word “plethora” emerges, exposing a stark gap in understanding. This scene parallels the current technology landscape in real assets, where excitement often masks a substantial depth-of-data deficiency.

Roger Ebert lambasted this masterpiece as “a send-up that doesn’t care enough even to be a satire.” It appears that this sentiment resonates beyond the realm of film critique, Roger. Rest in peace.

“[LLMs have] a deep and abiding love for things that look true, [fostering] a passion for things that feel like facts”

Now, let’s explore the shifting sands of applying subject matter expertise (SME) in this digital epoch. The boundary between freely available knowledge and bona fide expertise is becoming ever more blurred. Nowadays, expert knowledge sprawls across digital realms—scholarly articles, industry reports, extensive databases—all up for grabs and scrutiny by our favorite digital moocher(s). This widespread accessibility democratizes information but also blurs the lines, challenging us to distinguish true expertise from simple data accumulation. 

As LLMs operate at ludicrous speed, synthesizing and cross-referencing information at a pace that leaves human experts in the dust, they craft an illusion of interdisciplinary expertise. Yet, despite this high-speed data processing, LLMs fall short of emulating the deep, experience-based wisdom that only humans possess. “When a true genius appears in the world, you may know him by this sign, that the dunces are all in confederacy against him” – Jonathan Swift. This vividly encapsulates the challenges faced by LLMs, pushing the boundaries yet unable to replicate the nuanced intelligence of human expertise.

Gary Klein and Harry Collins, luminaries in cognitive and sociological studies, particularly in studying the intricacies of expertise and decision-making,  highlight that the essence of professional acumen is derived from rich, nuanced understandings and direct, tactile experiences—qualities that elude full capture in written words and thus remain out of reach for LLMs. For example, Collins delves into the realm of Tacit Knowledge, such as the art of discerning fine wines or the finesse of executing a musical performance, which depend profoundly on sensory immersion and personal rehearsal, qualities that simply cannot be distilled into text.

Moreover, expertise isn’t merely static knowledge but dynamically adapts to real-world conditions—a feat that LLMs’ incorporeal data cannot achieve alone. Situational awareness plays a crucial role here. It involves understanding the interplay of various factors in real-time, a capability that LLMs, without real-time feedback mechanisms, inherently lack.

"And I like you, don't wanna fight you. Know I'll always love you, but right now I got my life to live." Killer Mike

LLMs are valuable tools, but they are merely supplements to, not replacements for, the nuanced expertise of human professionals. LLMs do not encompass the depth, adaptiveness, or ethical dimensions of human expertise. These insights underline the indispensable role of human judgment and experience across enterprises. 

Ben Shneiderman’s advocacy for a “human-in-the-loop” approach to AI deployment reinforces this perspective. It emphasizes human oversight in leveraging AI’s capabilities while ensuring contextually informed decision-making.

The essence of expertise is vast and subtly intricate, requiring more than just factual recall but a deep, adaptive understanding that only human experience can provide. So we Osebergers believe…

  • Data extraction AI software requires human intervention to edit and QC the work. While it can make you more efficient, it can't do the whole job for you.
  • Our well-trained data operations team knows how to create structured data from unstructured data.
  • We leverage Al technology where practical and scale and improve our efficiency in creating data.
  • Our data analysts' data quality review process is a feedback loop that continually improves our machine learning models.

When you overly rely on a machine to do something…

grounded ambition

We really wanted to embrace unsupervised machine learning models (computers train themselves to identify the elements we wanted to extract from the documents and extract elements programmatically), but we found that there wasn’t enough high-quality structured training data and documented SMEs (the underlying data needed to create machine learning models) for unsupervised model development to be successful. We needed humans, too. Damnit. But it would have been so kewl. 

Cue the robot failing videos:

  • Oseberg: overhauling chaotic, unstructured documents into pristine, organized datasets because, honestly, someone had to do it. you think the AI beast is gonna feed itself? for the ones nodding yes, keep reading.
  • When the market's solutions were more patchwork or vendor lock-in than practical or desired, we didn't just make do with bailing wire and duct tape; we developed our own state-of-the-art data extraction platform. that whole build vs. buy dilemma.
  • Embraced human-assisted machine learning tech over 15 years ago—yes, "human-in-the-loop," before it was cool—a gentle reminder: robots haven't taken over yet. Human judgment remains non-negotiable in our process because, frankly, AI still asks too many dumb questions.
  • We've spent 15 years ignoring marketing, which is impressive. Our approach to explaining our uniqueness has been so subtle that it might as well be a secret handshake. To the ones who saw us, we love you for it (secret handshake).

Our online Cloud-based data extraction platform allows us to:

dStill was created to overcome Rob Anott’s skepticism that text extraction technology is creating the data needed to systematize most workflows.  A ton of valuable data is trapped in PDF and TIFF, and we needed text extraction technology to access it. This value was “hidden in plain sight,” but we needed a scalable solution to create the data. We believe we cracked the code for the workflow to create this data at scale:

unstructure

Source Data

• Source data is ingested into dStill

OCR1

Image Processing

• Image preparation (rotation correction, size resampling, etc.) to ensure OCR quality.
• OCR extracts text from a document image and inserts it into a new searchable PDF document version of the original

dataextract

Data Extraction

• Document classification
• Supervised ML Models (select domains)
• Technology-assisted data extraction

structure

Data Export

• Save to database
• Exportable to Excel format for easy viewing
• Data can be delivered in any format or integrated directly into applications

Oseberg’s proprietary unstructured data extraction platform was developed to create our commercial courthouse and regulatory data products.

Machine Learning models are combined with Oseberg analyst teams to build datasets efficiently.

millions of documents, representing well over tens of millions of pages, have been processed.

Technology alone can’t do all the work.

High Variability of Real Asset Data
  • Poor Image Quality (Up to 20% of images)
  • Corrupt Files (2-3% of images)
  • Collated and de-collated data intermixed
  • Multi-document files
  • Duplicate images
  • Incomplete Scans
  • Strikethroughs and Addendums
  • Hand-written materials
Imperfect Predictive Models
  • ML models must be supervised, otherwise errors are compounded and proliferated
  • Even state of art models will achieve 75%-90% F-score rates
QA/QC and Interpretation
  • Analysts must clean up after mistakes made by compounding error
Quality Analyst staff

Overcoming a host of problems with some listed above, dStill opened the floodgates to structuring data for Oseberg. We were now primed to create endless amounts of structured data from public images at scale and low cost. $XXMM later, farming in Montana looked even more promising. If an average completion filing had 150 attributes and an average oil & gas lease had 50 paragraphs that carried commercial terms that materially impacted the value of reserves, there was no way to scalably create structured data from that volume of images without machine learning and human assisted technology. 

With dStill, the economics carried a glimmer of hope for making the unit economics make cents.  

Current Prediction Models are delivering 90% optimization for courthouse documents containing difficult to extract, high value content (a few examples):

  • Pugh Clauses
  • Royalty Deduction Clauses
  • Depth Clauses
  • Cessation of Production Clauses
  • Continuation of Drilling Clauses

These investments have shifted more resources focused on quality rather than raw data capture.

Now, we could create structured data at scale, but we needed to start thinking about how to create value from that underlying dataset. 

“Get in, loser.” We’re now going to start talking about how to create value with data in upstream oil and gas or how to systematize workflows.

L,