A SambaNova GenAI project on human nutrition.
Inspiration
The creator was inspired by the Tricorder from Star Trek. It was a multi-function device
with the ability to scan, record and analyze it’s immediate surrounding.
What it does
This web client is designed to accept a wide range of inputs, including both text and
images, which it then uses to gather rich contextual information in the form of
knowledge graph triples from a graph database. The client combines this context with the
user’s prompt for multi-agents reasoning before sending it to the SambaNova API for
processing. After receiving the API’s output, the client performs additional
post-processing before finally rendering the results on the screen for the user to see.
How we built it
The project began with a research and design phase that spanned several days, ultimately
leading to a focus on the domain of human nutrition. A thorough review of existing and
emerging technologies was conducted to identify the most promising candidates for the
upcoming sprint. To enhance the capabilities of large language models (LLMs) and vision
language models (VLMs) with retrieval-augmented generation, the creators turned their
attention to modeling knowledge graph (KG) data, schema, and queries. This effort
resulted in the automated generation of KG triples, which formed the foundation of a
global KG designed to support graph-based question answering (QA).
Following the implementation of text-only natural language QA, the creator went on to
extend its capabilities by integrating vision-capable models for image captioning and
visual question-answering.
Challenges we ran into
- Limited context length on VLMs
- Aggressive rate limits on VLMs
- Modelling knowledge graph data and queries
Accomplishments that we’re proud of
- Getting LLMs output to fit within the context length constraints of VLMs
What we learned
- Integration of both VLMs and LLMs with semantic caching and graph-based retrieval-augmented generation is a non-trivial endeavor.
What’s next for Nutrition Tricorder
- Recognize toxins and hazardous compounds on ingredient lists and food labels
- Extending of VLMs to process video input.