Data Engineer (OCR & Data Pipelines, Contract)

Posted 4 hours 7 minutes ago by Intelance

Permanent
Part Time
Factory Jobs
England, United Kingdom
Job Description

Intelance is a specialist architecture and AI consultancy working with clients in regulated, high-trust environments (healthcare, pharma, life sciences, financial services). We are assembling a lean senior team to deliver an AI-assisted clinical report marking tool for a UK-based, UKAS-accredited organisation in human genetic testing.

We are looking for a Data Engineer (OCR & Pipelines) who can turn messy PDFs and documents into clean, reliable, auditable data flows for ML and downstream systems. This is a contract / freelance role (2-3 days/week) working closely with our AI Solution Architect, Lead ML Engineer, and Integration Engineer.

Tasks
  • Design and implement the end-to-end data pipeline for the project:
    • Ingest PDF/Word reports from secure storage
    • Run OCR / text extraction and layout parsing
    • Normalise, structure, and validate the data
    • Store outputs in a form ready for ML and integration.
  • Evaluate and configure OCR / document AI services (e.g. Azure Form Recognizer or similar), and wrap them in robust, retry-safe, cost-aware scripts/services.
  • Define and implement data contracts and schemas between ingestion, ML, and integration components (JSON/Parquet/relational as appropriate).
  • Build quality checks and validation rules (field presence, format, range checks, duplicate detection, basic anomaly checks).
  • Implement logging, monitoring, and lineage so every processed document can be traced from source > OCR > structured output > model input.
  • Work with the ML Engineer to ensure the pipeline exposes exactly the features and metadata needed for training, evaluation, and explainability.
  • Collaborate with the Integration Engineer to deliver clean batch or streaming feeds into the client's assessment system (API, CSV exports, or SFTP drop-zone).
  • Follow good security and privacy practices in all pipelines: encryption, access control, least privilege, and redaction where needed.
  • Contribute to infrastructure decisions (storage layout, job orchestration, simple CI/CD for data jobs).
  • Document the pipeline clearly: architecture diagrams, table/field definitions, data dictionaries, operational runbooks.
Requirements

Must-have

  • 3-5+ years of hands-on Data Engineering experience.
  • Strong Python skills, including building and packaging data processing scripts or services.
  • Practical experience with OCR / document processing (e.g. Tesseract, Azure Form Recognizer, AWS Textract, Google Document AI, or equivalent).
  • Solid experience building ETL / ELT pipelines on a major cloud platform (ideally Azure, but AWS/GCP is fine if you're comfortable switching).
  • Good knowledge of data modelling and file formats (JSON, CSV, Parquet, relational schemas).
  • Experience implementing data quality checks, logging, and monitoring for pipelines.
  • Understanding of security and privacy basics: encryption at rest/in transit, access control, secure handling of potentially sensitive data.
  • Comfortable working in a small, senior, remote team; able to take a loosely defined problem and design a clean, maintainable solution.
  • Available for 2-3 days per week on a contract basis, working largely remotely in UK or close European time zones.

Nice-to-have

  • Experience in healthcare, life sciences, diagnostics, or other regulated environments.
  • Familiarity with Azure Data Factory, Azure Functions, Databricks, or similar orchestration/compute tools.
  • Knowledge of basic MLOps concepts (feature stores, model input/output formats).
  • Experience with SFTP-based exchanges and batch integrations with legacy systems.
Benefits
  • Core impact role: you own the pipeline that makes the entire AI solution possible - without you, nothing moves.
  • Meaningful domain: your work supports external quality assessment in human genetic testing for labs worldwide.
  • Lean, senior team: work alongside experienced architects and ML engineers; minimal bureaucracy, direct access to decision-makers.
  • Remote-first, flexible: work from anywhere compatible with UK hours, 2-3 days/week.
  • Contract / freelance: competitive day rate, with potential extension into further phases and additional schemes if the pilot is successful.
  • Opportunity to build reusable data pipeline components that Intelance will deploy across future AI engagements.

We review every application personally. If there's a good match, we'll invite you to a short call to walk through the project, expectations, and next steps.