Imagine having a personal AI assistant powered by Google’s Gemini running on a Raspberry Pi. With the Pi listening to your voice and an advanced LLM in the cloud, you can ask questions or control devices and get spoken answers back. In this tutorial we’ll show you how to set up speech-to-text, send your queries to the Gemini API (a large language model), and speak out the responses. By the end, your Pi will answer questions via speaker using the Gemini LLM. This approach combines a compact Raspberry Pi with cloud AI which offers more privacy and control than commercial assistants. As one hobbyist noted, “I tried this on a Pi 4 and it worked great. The Gemini integration responded naturally to my voice commands.”
Requirements & Parts List
First, gather the hardware and software you’ll need. Here’s a quick checklist:
- Raspberry Pi board (Pi 4 or Pi 5 with 2-8GB RAM). A Pi 4/5 with 2+ GB RAM and 64-bit OS is recommended (32-bit builds may run into memory limits).
- Power supply and microSD card. A 5V/3A USB-C (for Pi 4/5) and a 16+ GB SD card with Raspberry Pi OS (64-bit).
- USB microphone. A plug-and-play USB mic or a microphone HAT. Ensures clear speech capture
- Speaker or headset. A USB speaker or a 3.5mm speaker for audio output (the Pi 4’s headphone jack can also work).
- (Optional) Camera module. A Pi Camera can enable vision-based triggers or image queries. For example, one project used the camera to detect a person to start the conversation.
- (Optional) Push-button or LED. For a wake button or status indicator. (Instead of a wake word, you could press a button to start listening.)
On the software side:
- Raspberry Pi OS (64-bit) or similar Linux. We’ll use Raspberry Pi OS (formerly Raspbian) 64-bit on a Pi 4/5.
- Python 3.10+ and required libraries. We’ll use Python to glue everything. Key packages include
google-generativeai
(Gemini SDK),SpeechRecognition
,gTTS
or similar for text-to-speech, and audio libraries (PyAudio
,pygame
, etc.). For example, one Gemini assistant project installs
pip3 install google-generativeai speechrecognition gtts pygame gpiozero
- Google Gemini API key. You’ll need a free API key from Google AI Studio. See Using Gemini API keys for details.
- Microphone and speaker drivers. Make sure your Pi recognizes the USB audio devices.
- (Optional) VNC/SSH client. For remote setup, install an SSH or VNC client on your PC.
Here we see a Pi 4 with a camera and USB speaker. This example hardware setup will let us both capture video (if we use it for triggers) and play the assistant’s voice answers. For more on Pi hardware projects, check our Weather Station Project guide (which covers GPIO wiring). You’ll notice the microphone (either USB or HAT) must be plugged in and tested (see Setup below). The speaker should be connected and set as the audio output in Raspberry Pi OS.
Raspberry Pi Setup
Flash the OS and update. Use Raspberry Pi Imager or Balena Etcher to write Raspberry Pi OS (64-bit) to the SD card. Enable SSH and Wi-Fi (if headless) in the advanced options. Boot the Pi with monitor/keyboard or headless SSH. Then run:
sudo apt update && sudo apt upgrade -y
This ensures the latest packages. One guide notes that the Pi 4 should run a 64-bit OS for heavy tasks to avoid memory errors.
Configure interfaces and audio. Run sudo raspi-config
. Under Interface Options, enable Audio and (if using) the Camera. Reboot if needed. Then check audio devices:
arecord -l # list recording devices (microphone)
aplay -l # list playback devices (speaker)
If your mic isn’t listed, try sudo apt install -y alsa-utils
and alsamixer
to unmute the mic input. Similarly, test the speaker output:
speaker-test -t wav -c 1
You should hear white noise. Adjust volumes with alsamixer
. These steps are similar to other Pi voice project. (In one Instructable, they use arecord
to record and aplay
to play a test file).
Install Python tools. Install Python and pip if not present:
sudo apt install -y python3 python3-pip python3-venv
Then create a virtual environment for isolation:
python3 -m venv ~/assistant-env
source ~/assistant-env/bin/activate
Now install the required libraries with pip. For example:
pip install google-generativeai speechrecognition gtts pygame
(You may need python3-dev
or libasound2-dev
if PyAudio
is used by speech_recognition
. If so, run sudo apt install -y portaudio19-dev
.)
Set the Gemini API key. Open your ~/.bashrc
(or .zshrc
) and add:
export GEMINI_API_KEY="<YOUR_API_KEY_HERE>"
Save and run source ~/.bashrc
to apply. The Gemini SDK will automatically pick up GEMINI_API_KEY
(or GOOGLE_API_KEY
) from the environment. You only need to do this once. If you prefer, you can also pass the key directly in code (see Google’s docs for examples), but using an env var is more secure. For details, see the official guide to Gemini API keys.
Optional) Additional setup. If you plan to use voice detection libraries (wake words) or local speech-to-text, you might install extras like pvporcupine
or OpenAI’s faster-whisper
. Also, install git
if you want to clone examples.
With these steps, your Raspberry Pi should be up-to-date and ready. Your microphone and speaker should be working, and Python has the necessary libraries. Next, we’ll implement the voice assistant logic.
Wake Word Detection and Audio Capture
We want the Pi to listen for a wake word or a button press before sending audio to the LLM. This avoids sending every sound to the cloud. A simple approach is to use a library like Porcupine or Snowboy for offline wake-word spotting. (Note: Snowboy is deprecated, but there are forks and alternatives.) These run on-device to detect a trigger phrase. For example, Picovoice’s free tier lets you define “hey Pi” or similar.
In this guide, to keep things simpler, we’ll demonstrate with a push-to-talk button or continuous listen (voice activation). For continuous capture, you can use Python’s speech_recognition
library:
import speech_recognition as sr
r = sr.Recognizer()
with sr.Microphone() as source:
print("Say your command...")
audio = r.listen(source)
command = r.recognize_google(audio) # uses Google Web STT
print("You said:", command)
This listens until you stop speaking, then uses Google’s speech-to-text (internet) to transcribe. (Alternatively, you could use open-source models like Whisper or Vosk for offline STT.) Be sure to test the microphone input first. One Instructable recommends recording with arecord
as a test; we did that aboveinstructables.com.
“Our assistant remains idle until it hears the wake word or a button press,” as an example. You could implement: if "assistant"
is in command
, proceed to query Gemini. Or simply always send the latest utterance. The key is capturing the user’s speech into text.
Using Google Gemini API
With audio captured as text, the next step is to send it to Gemini and get a reply. Google provides the google-generativeai Python SDK. After installing it, use code like:
import google.generativeai as genai
genai.configure(api_key=os.environ.get('GEMINI_API_KEY'))
model = genai.GenerativeModel("gemini-2.5-pro") # or "gemini-2.0-flash"
response = model.generate_content("What is the capital of France?")
print(response.text)
The configure
call reads your environment variable key. You can choose models like gemini-2.0-flash
or the newer gemini-2.5-pro
. For simple Q&A, generate_content
works well: we pass a prompt string and it returns an object whose .text
is the answer. For conversation, you could also use the chat interface (model.start_chat()
and chat.send_message()
), but single queries suffice for most voice commands.
query = "Hey Gemini, tell me a joke about computers."
answer = model.generate_content(query)
print("Gemini says:", answer.text)
Behind the scenes, this calls the Gemini API. Google’s official docs show similar code samples. For example, generating text:
from google.generativeai import GenerativeModel
genai.configure(api_key="YOUR_KEY")
model = GenerativeModel("gemini-2.5-pro")
response = model.generate_content("The opposite of hot is")
print(response.text) # prints "cold"
This is how you integrate the LLM. The response can be a full sentence or paragraph. We then take response.text
and send it to our TTS engine.
Check out our more Raspberry Pi Project Ideas and their Step-by-step guide.
Coding the Assistant: Speech‑to‑Text, LLM, Text‑to‑Speech
Below is a simplified outline of the Python code that ties everything together. You might put this in assistant.py
or similar.
import os
import google.generativeai as genai
import speech_recognition as sr
from gtts import gTTS
import pygame
# Initialize Gemini
genai.configure(api_key=os.environ['GEMINI_API_KEY'])
model = genai.GenerativeModel("gemini-2.5-pro")
# Function: listen and transcribe
r = sr.Recognizer()
with sr.Microphone() as source:
print("Listening...")
audio = r.listen(source) # listen to the mic
try:
command = r.recognize_google(audio) # use Google Web STT
print("You said:", command)
except sr.UnknownValueError:
print("Could not understand audio")
command = ""
# Send to Gemini and get response text
if command:
response = model.generate_content(command)
answer = response.text
print("Assistant:", answer)
# Convert text to speech
tts = gTTS(answer)
tts.save("reply.mp3")
pygame.mixer.init()
pygame.mixer.music.load("reply.mp3")
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
pass # wait until done speaking
This script does the following:
- Speech-to-text: Uses
speech_recognition
(with the Google STT engine) to convert the mic audio into a Python string. In practice,recognize_google()
requires internet, but you can replace it with an offline model if desired. - Call Gemini: We call
model.generate_content(command)
to get the AI’s reply. No explicit “system prompt” is shown here, but Gemini typically assumes the role of a helpful assistant. You can prepend instructions if needed. - Text-to-speech: We use
gTTS
(Google Text-to-Speech) to create an MP3 from the reply text, then play it withpygame
. You could also use local engines likeespeak
orpyttsx3
. Many Gemini projects usegtts
because it sounds quite natural.
Notice we initialized pygame.mixer
to play audio. The loop while ...
waits until speaking is finished. You could replace this with any method of playing an audio file on the Pi (e.g. calling mpg123 reply.mp3
).
This code is a basic template. In a real assistant, you’d wrap the listening and response in a loop, add error handling, and maybe a wake-word loop. But the above shows the core idea. For reference, one GitHub example of a Gemini voice assistant on Pi lists similar imports and setup.
Deployment: Auto-start and Enclosures
Once your voice assistant code is working, you’ll want it to run automatically on boot and to package it neatly. Here are some tips:
- Auto-start on boot: Create a
systemd
service or usecrontab
. For example, save your script as/home/pi/assistant.py
and add a cron job:
crontab -e
@reboot /usr/bin/python3 /home/pi/assistant.py &
- This runs the Python script on startup. Alternatively, write a
*.service
file in/etc/systemd/system
and enable it withsystemctl enable
. - Case/Enclosure: Put the Pi and audio gear in a proper case. A simple plastic case with openings for USB and camera is fine. Make sure the microphone isn’t obstructed and the speaker’s sound can exit. For example, a “RasTech” case or any Pi box works. Optionally mount the button on the case. For aesthetics, attach LED indicators (from
/usr/share/icons
or viagpiozero
) to show when listening or speaking. - Stability: Use a good power supply (especially if using camera and USB audio). A shaky power can cause reboots under load. If using a Pi 4/5, enable memory split to allocate enough GPU memory for audio. For instance, you might give 16-32MB to the GPU (in
raspi-config
under Performance). - Networking: Ensure the Pi is online at boot (if using Gemini or any cloud API). If using Wi-Fi, configure the
wpa_supplicant.conf
or prefer Ethernet.
By following these steps, the assistant script will launch on startup and run headless. You can then simply speak to it after power-on. For a polished product, attach the Pi and speaker to a dedicated enclosure and label the button (if used) with “Talk” or the wake phrase.
FAQs
Check with arecord -l
. If no input shows, ensure the mic is plugged in properly, install alsa-utils
, and unmute it in alsamixer
. Also verify power supply is sufficient—underpowered Pi can disable USB devices.
Ensure your API key is active in Google AI Studio. Store it as an environment variable: export GEMINI_API_KEY=your_key
and restart the shell or script.
Confirm the speaker is working via aplay test.wav
. Check if pygame.mixer.init()
ran successfully, and ensure reply.mp3
exists. Volume may need adjustment via alsamixer
.
Poor mic quality, background noise, or wrong device index may cause it. Test using arecord
. Also check recognize_google()
quota; replace with Vosk or Whisper for offline STT if needed.
Use crontab -e
and add @reboot /usr/bin/python3 /home/pi/assistant.py &
. Alternatively, create a systemd
service with your environment variables for stable background execution.