Back to Jobs

Language Data Scientist II, AWS AI Data | Transcribe

Remote, USA Full-time Posted 2025-11-02
About the position Responsibilities • Translate business, modeling and ethical requirements in Health AI into executable data collection projects. • Design human-in-the-loop evaluation tasks to measure the performance and usability of models in the medical domain. • Develop the materials necessary to execute successful data collection efforts such as guidelines, annotation interfaces, quality assurance workflows. • Support the sourcing and/or creation of high-quality language datasets and language artifacts for feature and language expansion. • Analyze structured and unstructured data to provide actionable recommendations to improve data quality or model performance. • Iterate and innovate on data collection methodologies to improve data turnaround time and reliability. • Incorporate LLMs, prompt engineering, and ML techniques to automate repetitive annotation and data creation workflows. Requirements • 2+ years of data scientist experience. • 3+ years of data querying languages (e.g. SQL), scripting languages (e.g. Python) or statistical/mathematical software (e.g. R, SAS, Matlab, etc.) experience. • PhD in a language and human behavior related field with a strong quantitative component (e.g., Cognitive Linguistics, Sociolinguistics, Human-Computer Interaction); or, a Master's degree with 3+ years of field experience. • Experience in data mining and cleaning for NLP machine learning model pipelines. • Experience in language data collection for quantitative analysis, including guidelines, workflow design. • Experience in research and experimental design involving human participants. • Experience in statistical measures for data quality assessment and research hypotheses testing. • Practical knowledge of data labeling tools and techniques (e.g., Amazon SageMaker Ground Truth, brat, ELAN). • Excellent knowledge of semantics, pragmatics, conversation analysis, and/or discourse analysis. • Ability to explain complex concepts and solutions in easy-to-understand terms. Nice-to-haves • Experience with LLMs and prompt engineering techniques and other programmatic approaches to annotation, including weak supervision and active learning. • Practical knowledge of version control systems (e.g. Git). • Experience with spoken data collection, speech analysis, speech transcription (from scratch or ASR-assisted). • Experience working with clinical or medical data, such as medical transcriptions, clinical notes, or electronic health records (EHRs). • Knowledge of healthcare terminology and medical ontologies (e.g., SNOMED CT, ICD, RxNorm). Benefits • Medical, financial, and/or other benefits including equity and sign-on payments. • Flexible working culture to support work-life balance. • Mentorship and career growth resources. • Employee-led affinity groups fostering a culture of inclusion. Apply tot his job Apply To this Job

Similar Jobs