Extracting Honolulu Arrest Records from PDFs with AI

honoluluarrestlogs.com provides the last two weeks of arrest records from the Honolulu Police Department. The site makes charges, locations, and release details searchable (with names and addresses partially redacted), whereas the official source publishes logs as “flattened” PDFs which display tables of text as static content rather than selectable characters. The automated extraction process relies on Google’s Gemini 3 models, and has currently processed nearly 2,000 records from over 100 documents during January 2026.

How It Works

The system uses Google’s Gemini 3 model family. The Flash model is faster and cheaper, so it handles the bulk of the extraction work. The Pro model has stronger reasoning, so it is used in cases where judgment matters. All models run at temperature 1.0.

The process relies on self-consistency. Because AI hallucinations are non-deterministic, separate attempts to read the data will rarely make the exact same mistake. Therefore, if multiple attempts agree, it is a reliable indicator of accuracy. The system runs three extractions at the same time and only automatically accepts data if all three match perfectly. If they disagree, a stronger reasoning model resolves the conflict.

flowchart TD
    PDF[PDF document] --> E1[Extractor 1]
    PDF --> E2[Extractor 2]
    PDF --> E3[Extractor 3]

    E1 --> Compare{100% match?}
    E2 --> Compare
    E3 --> Compare

    Compare -->|Yes| Accept[Accept result]
    Compare -->|No| Arbiter[Arbiter reviews]

    Arbiter --> Auditor{Auditor approves?}

    Auditor -->|Yes| Accept
    Auditor -->|No| Retry[Retry from start]
    Retry --> PDF

    Accept --> DB[(Database)]

Extractors

Each arrest record includes 12 details, such as the person’s name, age, race, charge, location, and the arresting officer, etc. Instead of extracting one full arrest entry at a time, the system collects all values for one specific detail at a time across the entire document (i.e., gathering all report numbers first, then all arrest dates, and so on). After collecting these lists, the system combines them in order to rebuild the complete records.

This approach is an application of task decomposition, where a complex problem is broken down into smaller and simpler sub-tasks. Extracting a complete multi-field record while handling page breaks, truncated text, and positional ambiguity (where a name appears once at the top of a page but applies to dozens of subsequent charges) is error-prone. By contrast, focusing on just one data type at a time allows for specific prompting instructions that are simpler and more reliable.

For most fields, the system uses the Gemini 3 Flash model. For fields where Flash is more susceptible to error (specifically those containing unusual names and locations) the system uses two Flash models and one Pro model. If only the Flash model was used, there is a higher risk that all three attempts could potentially make the exact same mistake. Adding the more accurate Pro model reduces this risk. When all three results match, the data is accepted automatically, which happens over 96% of the time.

Arbiter

When the extraction results disagree (less than 4% of the time), an arbiter (Gemini 3 Pro) compares the three options against the original PDF. It selects the correct one or, if all three are wrong, provides a correction. Since assessing text is easier than generating it, the LLM-as-a-judge technique is applied, where a stronger model evaluates the faster models’ work and only writes new answers if all three attempts failed.

Auditor

After the arbiter makes a decision, an auditor (Gemini 3 Pro) reviews it. The prompt applied to this model is intentionally strict and adversarial: “do not give the benefit of the doubt” and “reject if you find ANY error.” This acts as a final safety net against rare mistakes. About 3% of arbiter decisions are rejected. If a result is rejected, the entire process restarts from the beginning and repeats until there is full agreement.

Error Patterns

When discrepancies arise, they usually involve two specific data types: names and streets.

Names

The smaller Flash model sometimes adds, drops, or swaps letters.

“ISAIAH” → “ISIAIAH” (hallucinated extra “I”)
“BUMGARDENER” → “BUMGARDNER” (missing “E”)
“RODERIC” → “RODERICK” (hallucinated “K”)
“JEFF” hallucinated as “CLEAR”

In one case, both Flash extractors misspelled “KEAWEMAUHILI” as “KEAWEWMAUHILI”, while only the Pro model spelled it correctly.

Streets

Hawaiian street names present similar challenges, often resulting in dropped letters.

“KALANIANAOLE HWY” → “KALIANAOLE” (missing “AN”)
“KAPAHULU AVE” → “KAPAULU” (missing “H”)
“KULAAUPUNI ST” → “KULAUPUNI” (missing “A”)

The Pro model demonstrates better character-level reasoning for these street names and resolves discrepancies correctly.

Cost Analysis

It costs about $0.06 to process each PDF, based on data from 110 files containing ~2,000 records. With four documents published daily, the total cost is around $0.24 per day. This cost structure is supported by context caching, which is a feature that allows a file to be uploaded once and reused for one-tenth the standard rate.

Breakdown by Model

Gemini 3 Flash handles the bulk of the work (~33 queries per document), costing $0.038 per document.
Gemini 3 Pro is used selectively (~4 queries per document) for resolving conflicts and difficult fields, costing $0.023 per document.

The Math

Here is the average usage data per document (each PDF averages about 5 pages). “Input” refers to the prompt instructions sent with each query. “Cache” refers to the PDF content (~2,700 tokens). “Output” refers to the model’s response.

Bulk Extraction (Flash)

9 fields × 3 queries + 3 fields × 2 queries = 33 queries per document.

Input: ~12,500 tokens × $0.50/M ≈ $0.0063
Cache: 33 queries × ~2,700 tokens = ~89,100 tokens × $0.05/M ≈ $0.0045
Output: ~9,000 tokens × $3.00/M ≈ $0.0270
Subtotal: ~$0.038 per PDF

High-Reasoning Extraction (Pro)

3 fields × 1 query = 3 queries per document. This is the 3rd extractor for high-risk fields (Name, Officer, Location).

Input: ~1,600 tokens × $2.00/M ≈ $0.0032
Cache: 3 queries × ~2,700 tokens = ~8,100 tokens × $0.20/M ≈ $0.0016
Output: ~800 tokens × $12.00/M ≈ $0.0096
Subtotal: ~$0.014 per PDF

Conflict Resolution (Pro)

Arbiter and auditor review to resolve disagreements. These occur for ~4% of fields, averaging ~1 query per document.

Input: ~2,200 tokens × $2.00/M ≈ $0.0044
Cache: ~1 query × ~2,700 tokens = ~2,700 tokens × $0.20/M ≈ $0.0005
Output: ~300 tokens × $12.00/M ≈ $0.0036
Subtotal: ~$0.009 per PDF

Grand Total

$0.038 + $0.014 + $0.009 = $0.061 per PDF

(Note: Context caching also incurs a storage fee of $1.00 (Flash) or $4.50 (Pro) per million tokens per hour. For these small PDF documents processed in a few minutes, this storage cost is negligible at <$0.002 per document and is excluded from the total above.)

Practical Applications

Structured arrest data has a range of potential uses. Criminal defense attorneys, for example, could use it to solicit clients by filtering by arrest location, charge type, or other criteria relevant to their practice. Others may want to track areas where arrests are frequent and representation is often needed quickly.

More broadly, this searchable dataset allows for analysis that would otherwise require manual extraction from every PDF. It allows tracking policing trends to see which crimes are most common and if certain neighborhoods are targeted more than others. It also makes it easier to study demographics and watch for unfair treatment across different groups or areas.

Conclusion

With multiple layers of checks, errors only enter the database if every single review fails in the exact same way. The chance of a mistake drops with each additional check, and because the process is cheap, adding more models to the mix is always an option. This system assumes the AI will make mistakes, but it is designed to catch them before they spread. This approach provides reliable data, and the low cost makes it practical to add more safeguards if higher certainty is needed.