Job Duties/Essential Functions
- Design, build, and test Databricks workflows and jobs for both batch and streaming
- Design clinical architecture and develop, implement frameworks such as Common Data Models (OMOP, CDISC, FHIR7, etc.) and techniques to organize clinical data from disparate sources.
- Design a secure, cloud-based platform for acquiring and aggregating patient data to then be consumed by analytics workflows.
- Collaborate with clinical data experts to select and implement ontologies (ICD10, SNOMED, RxNorm, etc) and to translate between disparate ontologies.
- Implements repeatable techniques and methods around data transfer, pipeline testing, and platform infrastructure and management components.
- Creates extract tools utilizing CDC, API’s, and SDK’s from source systems like Medrio and other EDC’s to hydrate the Lakehouse.
- Creates SQL and Pyspark notebooks and packages to facilitate the movement, cleaning, and storage or data.
- Enhancing functionality and scalability of the client services through technology innovations
- Adhering to and promoting high standards in testing and integration. This includes writing unit tests as well as integration or pipeline tests.
- Collaborating with other teams, including biological imaging, cancer biology, clinical data, and engineering
- Attention to detail
- Ability to work both independently and as part of a team
- Ability to design and develop robust, scalable cloud-based solutions
- Degree in Computer Science or related field desired
- 5-7 years relevant experience
- Must have experience in one or more pipeline or orchestration tools (Databricks, Snowpipes, Apache Airflow, SQL server integration services, AWS step functions, Synapse, etc.) Databricks is preferred.
- Must have experience working with SQL (any) and be able to craft advanced level queries.
- Understands REST API’s and how to utilize them to acquire and send data.
- Appropriate understanding of Git and source control.
- Must have experience working with and implementing data lakes.
- Must have advanced ability in at least one major language including: python, Pyspark, SQL, Scala.
- Must have experience with clinical data, maintaining clinical data lifecycle and implementing common data models. Preference is given to someone who has combined data into a single ontology from multiple disparate data sources.
- Prior experience working with OMOP, CDM, and FHIR preferred
- Prior experience working in biotech, clinical, healthcare, or life sciences preferred
- Experience with modern mathematical and statistical software, and preferably AI/ML frameworks desired
- No certifications or specializations required.
- Databricks or other Lakehouse platform (Fabric, Snowflake, GCP, AWS, etc.) certifications are preferred.