1.1 Initiative Summary:

Businesses looking to operationalize LLM-supported applications will benefit from using cloud-based (private or public) LLM “as a service” (LLMaaS) platforms for governance and scalability. Among many features, data governance (primarily for unstructured text) will be a critical offering of these platforms, including that from Blattner Technologies. This initiative will focus on contributing to the development of an extensible end-to-end data governance framework, including external data ingestion, parallelized data preparation and analytics, and versioning.

1.2 Desired Outcomes

- Prototype innovative workflow-based capabilities for preparing unstructured text in a scalable, traceable, and intuitive manner for downstream LLM-related tasks, such as training and fine-tuning.

- Presentation to broader company highlighting approach, challenges, solutions, and significant insights stemming from the effort.

1.3 Core Skills Required

- Required skills:

o Fundamental LLM knowledge (e.g., prompt engineering, fine-tuning)

o NLP-based development (e.g., tokenization, embedding generation, and operations, textfication)

o Python development

o Experience with parallel distributed systems and/or parallel computation libraries such as Spark, Dask, or RAPIDS

- Optional/preferable skills:

o Kubeflow

o Vector databases

o Experience with NLP libraries such as spaCy and gensim

1.4 Estimated Effort

- Full-time summer internship (40 hours/week)

- Depending on progress, work may extend to part-time during the Fall semester (e.g., 10 hours/week)

1.5 Additional Information

This is a remote internship opportunity, working with summer mentors and reporting to the Chief Product Officer of BOSS AI. The group has a deep focus on implementing LLMs “as a service” (LLMaaS) and team members have a range of skills from enterprise software engineering, NLP, ML, and UX. You can expect to gain valuable experience in operationalizing LLMs and addressing critical security needs for all language models.