AI Needs Well-Formatted Data – Cal Lab Magazine

A couple of months ago, I presented at NCSLI on AI (Artificial Intelligence) and why AI doesn’t work so well in Metrology. At the heart of the problem is the mess of unstructured data.

Large Language Models (LLMs) operate by predicting the most probable next word in a sequence. Trained on massive datasets of text, they learn complex linguistic patterns to generate coherent and contextually relevant responses. When given a prompt, an LLM does not understand it in a human sense; instead, it statistically determines which words should follow based on its training. This “next-word prediction” mechanism, while simple in concept, enables LLMs to perform sophisticated tasks like writing essays and answering questions.

However, this text-based architecture makes LLMs fundamentally ill-suited for calculations, often required for analyzing scientific data. Lacking an inherent understanding of mathematical principles, they don’t perform arithmetic like a calculator. Instead, an LLM guesses an answer by recalling how similar calculations appeared in its training data. This probabilistic approach makes them prone to errors and “hallucinating” numerically unsound results, as their strength lies in linguistic fluency, not the deterministic accuracy required for computation.

Writing automated calibration procedures is 80% about getting the test points, test limits, and all the unit-under-test’s settings correct. Looking up and calculating the test points from the equipment specification is NOT something LLMs are trained to do, and can lead to severe hallucinations. It is like having an untrained calibration technician just guessing what the test limits should be.

Manufacturers present specifications in columns, a format intuitive for human readers. However, Large Language Models (LLMs) process information sequentially, reading from left to right. This fundamental mismatch causes LLMs to struggle with associating vertically aligned values, labels, and units. This creates a significant bottleneck, leading to errors in automated data extraction and analysis.

What everybody writing automation for calibration wants an AI to do, is write the automated calibration procedure. So I thought I would give it a try with my AI server running Ollama.

After uploading the HP 34401A’s calibration procedure and the specifications for the Fluke 5730A, along with the Fluke MET/CAL® help files, I asked several LLMs to write a MET/CAL procedure to calibrate the 34401A with the 5730A.

Based on my previous experience with AI, I didn’t expect a successful outcome. I also didn’t expect such a resounding failure. Though I was aware of the structural issues with the specification formatting in the manuals, I didn’t realize the potential for errors was multi-layered.

Not only are the specifications in a column format, those columns have different specifications based on different time periods. The 34401A has 24 hour (23 °C ± 1 °C), 90 Day (23 °C ± 5 °C), and 1 Year (23 °C ± 5 °C) specifications with an additional confusing column for Temperature Coefficient /°C (0 °C – 18 °C) & (28 °C – 55 °C).

In my initial prompt, I didn’t add a specification interval or temperature, but don’t worry, updating the AI prompt didn’t help!

Then I opened up the Fluke 5730A specifications and found even more potential confusion for an LLM. The intervals are similar, with the 5730A adding a 180 Day set of specifications, but the problems start with confidence. The 5730A has specifications for two different confidence levels, 99% and 95%. Even if I put the confidence level in the prompt, the 34401A documents didn’t include confidence.

Temperature can also lead to confusion because the 5730A specs are written as “Absolute / ±5 °C from calibration temperature” with an additional temperature spec of “Relative ±1 °C” for 24 Hour and 90 Day specs. If the AI was to write this procedure meeting 100% of the quality and uncertainty calculations, it would have to know the calibration temperature of the 5730A when it was calibrated, something I didn’t add to the prompt.

I know this is a long way of saying “We need well-formatted data in metrology to enable AI to generate reliable results!”

Related Articles

Rethinking the Flexible Standards Paradigm

Creating a Metrology Taxonomy

The GUM Tree Calculator (GTC)