John Xi Qiu
How’s it going? I’m John. I hope you’re having a great day!
I’m a Machine Learning Engineer and Natural Language Processing (NLP) Scientist. I specialize in developing and deploying deep learning and models at big data scale.
I’m currently at Thomson Reuters as an Applied Scientist in Natural Language Processing and Information Retrieval where I’m developing AI solutions for Legal Research, powering Westlaw and CoCounsel.
I’ve previously built and deployed AI solutions with web advertising data, customer service dialogues, and medical text records. I’ve published research while affiliated with Capital One, Oak Ridge National Lab, and the University of Tennessee, Knoxville.
I’m based in Arlington, Virginia in the Washington DC Metro. For fun I like board games, playing guitar or violin, powerlifting, and especially rock climbing!
Experience
Jan 2024 - Present: Thomson Reuters
Applied Scientist - Natural Language Processing and Information Retrieval
- Lead development of a scalable Retrieval Augmented Generation system for legal document summarization and question answering.
Core Technologies used: Python, Openai API, Azure, AWS, Pydantic, Langchain
Sept 2023 - Jan 2024: Digital Harbor
Lead Machine Learning Engineer
- Developed AI Conversational Intelligence Platform using OpenAI’s Whisper voice-to-text transcription and GPT prompt generation for video call transcription and summarization.
Core Technologies used: Python, Huggingface, Transformers, Openai API, LangChain
Jan 2022 - Sept 2023: NextRoll Inc.
Senior Machine Learning Engineer - Core Engineering
- Lead development and deployment of a novel low-latency ad selection system for upper bid-funnel ad valuation in real-time bidding.
- Designed low-latency neural model serving system using pytorch for model training and Go model serving.
- Developed daily model refit MLOps Pipeline with Apache Airflow
- Ran A/B tests against a baseline sampling model and found the neural model improved ad selection increasing click-through rates by over 15% and reduced Cost-Per-Click cost metrics by over 20%.
- Followed Agile code development and lead sprint planning, scoping and grooming work tickets.
Technologies used: python, pyspark, pytorch, go, apache airflow, aws, s3, ec2, docker, ecr, jira, buildkite, datadog, pagerduty, sql
Feb 2019 - Jan 2022: Capital One Financial Organization
Senior -> Principal Data Scientist
- Technical lead developing NLP models using customer service call transcript data with Python, scikit-learn, pytorch
- Deployed tools include supervised call reason detection, complaint detection, Net promoter score prediction, fraudster detection, job title canonicalization. Saved over 20 million dollars yearly in 2022 in agent time costs.
- Experimenting with fine-tuned BERT LLM transformer models for unsupervised theme detection in call transcripts resulting in patent filing: US20220383867A1 and first author on corresponding publication.
- Wrote quarterly model monitoring reports to track model performance, stability, and model explanatory analysis. Performed data ETL and analysis with SQL, Snowflake, Databricks.
Technologies used: python, pyspark, pytorch, scikit-learn, databricks, aws, s3, ec2, docker, ecr, jira, jenkins, sql
2016 - 2019: Oak Ridge National Lab
Health Data Science Institute Researcher
- Lead development of National Cancer Institute’s NLP Pilot Program for AI cancer diagnosis using text pathology reports with Python, scikit-learn, pytorch, tensorflow, theano, keras
- Compared statistical NLP modeling - logistic regression, svms, xgboost, vs emerging deep learning architectures - CNNs, RNNs, LSTMs, Attention networks, LLMs including BERT.
- Applied population scale deep learning leveraging DoE Supercomputers with distributed data parallelism.
- Developed novel approaches for class imbalance and label scarcity with multitask modeling, synthetic data (smote, snorkel), pre trained word embeddings (gensim, glove) and subword embeddings (fastText, Byte Pair Encodings), tagging (spacy, nltk)
- Mentored undergraduate and graduate student interns resulting in publications and co-authorships.
Technologies used: python, pytorch, scikit-learn, tensorflow, theano, keras, gensim, glove, fastText, spacy, nltk
2015: University of Tennessee, Knoxville
Teaching Assistant - Statistics 201
Education
- 2016: M.S. Business Analytics, University of Tennessee, Haslam College of Business
- 2014: B.B.A. Economics with Minor in Math, University of Tennessee, Haslam College of Business