Pronab Chandra Roy

Pronab Chandra Roy

Big Data Engineer in
Open to Healthcare Research Collaborations

Research Profile

By profession, I am a Big Data Engineer at ACI Limited, one of the largest conglomerates in Bangladesh, where I work on reliable data pipelines, KPI datasets, and analytics-ready reporting systems. Before this, I worked in US healthcare RCM at Augmedix / Commure, where I handled claims, denials, reimbursements, billing operations, and EHR-driven patient encounter workflows. I am now building public, reproducible projects around CMS DE-SynPUF, TCGA/GTEx, and MIMIC-IV to prepare for PhD labs working on large healthcare datasets, clinical data engineering, healthcare AI, and biomedical data mining.

Education

B.Tech in Computer Science and Engineering

National Institute of Technology, SilcharCertificate

Assam, India Dec 2020 – June 2024
CGPA: 8.71/10.00Top 5% in the Department
Thesis: Advanced Image Denoising through Combined Directional Wavelet based NLM and BM3D Technique.

Career Journey

2020

B.Tech at NIT Silchar

Started Computer Science & Engineering with ICCR Scholarship

2023

Research Intern at SN Bose

Haar Wavelet & RLE research leading to IEEE publication

2023

IEEE Publication

Published at IEEE ICIC 2023, Indonesia

2024

Data Analyst at Augmedix/Commure

US healthcare RCM analytics and reporting

2026

Big Data Engineer at ACI Limited

Production data pipelines, KPI datasets, and analytics reporting

Publications

Combining Haar Transform and Row-Column Directional RLE in Image Compression

A. D. Emon, P. C. Roy, L. D. Singh, and A. K. Saha

TLDR: Combined Haar Transform with Row-Column Directional RLE for image compression; analyzed pixel-level transforms and structural sparsity for lossless encoding.

Work Experience

Big Data Engineer

ACI Limited

Jan 2026 – Present Dhaka, Bangladesh
6 monthsCurrent
  • Maintain and improve production data pipelines ensuring business teams receive clean, reliable data on time.
  • Build SQL- and Python-based KPI datasets, reporting tables, and documentation for repeatable analytics.
Nov 2024 – Jan 2026 Dhaka, Bangladesh
1 year 3 months
  • Processed US healthcare RCM data covering claims, reimbursements, denials, billing operations, and payer-side patterns.
  • Built dashboards for billing quality and denial trends, reducing manual QA effort by 70%.
  • Mapped EHR workflows to billing formats across WebPT, eClinicalWorks, Epic, AdvancedMD, and DrChrono.

Research Intern – SN Bose Summer Internship

NIT SilcharCertificate

Jun 2023 – Jul 2023 Silchar, India
2 months
  • Researched image compression techniques; implemented algorithms in Python/NumPy leading to an IEEE ICIC 2023 publication.

Selected Healthcare Data Projects

Real-Time Streaming Pipeline for Synthetic Healthcare Analytics

2026
Streaming Data Engineering View Repo

Architecture: This diagram details the end-to-end flow from data generation to real-time clinical visualization, highlighting the exact stream-processing mechanics.

  • Data Quality: Handled out-of-order records using event-time watermarks, ensuring temporal accuracy in clinical events.
  • Scalability: Processed 302,332 events at 1,000 events/sec with zero invalid records in the clean-run validation layer.
  • Transferability: Architecture directly maps to real-world FHIR subscription feeds for live hospital integrations.
Apache Flink
Apache Kafka
ClickHouse
Grafana
Python
Docker
Project architecture diagram 1

MIMIC-IV Clinical NLP Pipeline

2026
Clinical NLP & LLMs View Repo

Architecture: This diagram breaks down the NLP pipeline, showing how unstructured clinical text is de-identified, parsed, and mapped to structured clinical phenotypes.

  • Privacy-First: Implemented rigorous de-identification to safely handle Protected Health Information (PHI) in unstructured text.
  • Clinical Context: Leveraged transformer-based NLP to extract complex medical entities from noisy, jargon-heavy discharge summaries.
  • Interoperability: Mapped extracted entities to standard terminologies (ICD-10, SNOMED-CT), making unstructured notes available for SQL-based analytical querying.
Python
Transformers
Clinical NLP
MIMIC-IV
NER
Pandas
Project architecture diagram 2

MIMIC-IV Clinical Data Warehouse — ICU EHR Analytics

2026
Clinical Data Warehouse View Repo

Architecture: This diagram illustrates the medallion-style data engineering architecture (Bronze/Silver/Gold) used to transform raw ICU tables into ML-ready clinical datasets.

  • Time-Series Aggregation: Engineered complex "first-24-hour" feature extractions from high-frequency ICU vitals (chartevents).
  • Machine Learning Ready: Prepared baseline datasets optimized for in-hospital mortality prediction models.
  • Secure Access: Built a guarded NL-to-SQL interface that translates natural language to SQL queries while preventing patient-level data exposure.
Python
DuckDB
SQL
FastAPI
Validation
Project architecture diagram 3

CMS DE-SynPUF Medicare Claims Pipeline

2026
Claims Analytics View Repo

Architecture: This diagram shows the transformation of standardized CMS claim files into a dimensional model (Star Schema) for healthcare cost analytics.

  • Claims Standardization: Parsed complex Medicare claim formats, linking beneficiary summaries to inpatient/outpatient/prescription events.
  • Clinical Logic Implementation: Mapped ICD-10 diagnosis codes to Chronic Condition Data Warehouse (CCW) logic to identify patient chronic conditions.
  • Predictive Analytics: Created "next-year high-cost" labels, enabling actuaries and researchers to train cost-prediction models.
Python
DuckDB
SQL
Streamlit
FastAPI
Project architecture diagram 4

CancerOmicsLake — TCGA/GTEx Bioinformatics Data Engineering

2026
Bioinformatics Data Engineering View Repo

Architecture: This diagram details the bioinformatics ETL process, ingesting raw genomic files and converting them into analysis-ready Parquet tables and graph databases.

  • Multi-Omics Integration: Unified tumor genomic data (TCGA) with healthy baseline expression data (GTEx) for robust differential analysis.
  • Bioinformatics Parsing: Processed complex Mutation Annotation Format (MAF) files and normalized raw RNA-seq expression counts.
  • Graph Database Ready: Modeled genomic relationships (genes, mutations, patients, pathways) into node-edge CSV formats for Neo4j visualization.
Python
Pandas
Parquet
TCGA
GTEx
Project architecture diagram 5

Skills & Expertise

Healthcare DataClinical NLPBioinformaticsEHR/ClaimsData EngineeringBiomedical Mining
Research Domain Distribution

Core Skills (List View)

Python
SQL
Bash
DuckDB
PostgreSQL
Apache Spark
Apache Flink
Apache Kafka
ClickHouse
Airflow
Pandas
Scikit-learn
FastAPI
Streamlit
AWS
Docker
Languages
Python
SQL
Bash
Data Engineering
ETL/ELT
Data Warehousing
DuckDB
PostgreSQL
Apache Spark
Apache Flink
Apache Kafka
ClickHouse
Airflow
Healthcare Domains
ICD-10, CPT/HCPCS Coding
CMS-1500 & UB-04 Claims
Medicare/Medicaid Insurance
EHR-to-Claims Data Mapping
Clinical Data Warehousing
Cohort Building
AI/ML & Analytics
Pandas
NumPy
Clinical NLP
Scikit-learn
Feature Engineering
Model Evaluation
Apps & Tools
FastAPI
Git
Linux
AWS (IAM, S3, Redshift, Lambda, EC2)
Docker
pytest
Research Interests
Large Healthcare Datasets
Biomedical Informatics
Clinical NLP/LLMs
Reproducible EHR/Claims Research

Achievements & Certifications

Coding Profiles

LeetCode

Knight1900+ rating

Knight-ranked competitive programmer with strong DSA fundamentals.

View Profile

CodeChef

4-Star1800+ rating

4× Top 500 Global Rank · 200+ Problems Solved · Global Rank 12 (1 contest)

View Profile

Certifications

Additional Certificates - Data Engineering & Cloud (9)

Honors & Awards

ICCR Scholarship (2020–2024 Session)Certificate

Indian Council for Cultural Relations, fully funded Indian Govt. merit scholarship.

2020

Extra Curriculum

Competitive Programmer

Active on LeetCode and CodeChef; achieved Knight ranking and 4-star rating with strong algorithmic foundations.

Healthcare Data Portfolio

Maintains public, privacy-safe repositories for batch and streaming healthcare data engineering, intended for academic review and PhD outreach.

Cultural Representative, Silchar-Sylhet Festival 2023

06/10/2023 – 08/10/2023

Served as the representative of Bangladesh in this bilateral cultural festival held in India, fostering cross-border dialogue and managing cultural exchanges as a foreign student.

Event Link

References

AM

Adam Mohiuddin

Senior Manager

Commure (USA)

AKS

Dr. Anish Kumar Saha

Assistant Professor

Dept. of CSE, NIT Silchar

Get in Touch

Interested in collaborating on healthcare data engineering, clinical analytics, or biomedical data mining? Feel free to reach out!

Dhaka, Bangladesh · Current time: