Alexa-Gemini Bridge

Introduction

We have a couple of Amazon's Echo devices at home, which are useful for a lot of things - turning the lights on and off, setting timers, adding things to our Bring shopping list (which I strongly endorse btw), turning my electric blanket on and off in the winter months, and so on. They're pretty nifty!

Compared to the conversational capabilities of ChatGPT, Claude, Gemini & co however, Alexa's intelligence in this area leaves a lot to be desired. Alexa Plus is coming soon (and looks pretty cool!), but there's no word on when it will be made available in Germany (where I live), so I couldn't wait for that. Another option would have been to buy some Google Nest smart speaker devices, but I already have some Echos, and I don't want to have 2 competing smart home systems at home.

So I decided to hack a solution together. This took me about an hour, although I could never have done it without help from Claude and ChatGPT - I've never created a node.js app before (though I have played with them a little bit), or an Alexa Skill. But now I've done both!

Overview

This project creates a bridge between Amazon Alexa and Google's Gemini AI model. It provides a custom Alexa skill that captures user queries, sends them to the Gemini AI API, and returns the responses back to the user through Alexa.

Key features:

Child-safe responses (optimized for 7-year-old level understanding)
Handles various question types through a flexible Alexa interaction model
Cleans and formats Gemini responses for optimal speech delivery
Deploys easily with Fly.io

How It Works

User Interaction: The user activates the Alexa skill and asks a question
Alexa Processing: The Alexa skill captures the user's query and sends it to this server
Gemini AI Query: The server forwards the query to Gemini API with appropriate instructions
Response Formatting: The Gemini response is cleaned and formatted for voice delivery
Alexa Response: The formatted answer is sent back to Alexa, which speaks it to the user

Technical Architecture

User → Alexa Skill → Alexa-Gemini Bridge → Gemini AI API → Response

The application is built using:

Node.js
Express
Axios for API calls
dotenv for environment variable management

Setup and Installation

Prerequisites

Node.js (v14 or higher)
An Amazon Developer account
A Google Cloud account with Gemini API access (API key required - Google is currently very generous with their API and you can get substantial free usage.)
A fly.io account

Local Installation

Clone the repository:

git clone https://github.com/thecraigd/alexa-gemini-bridge.git
cd alexa-gemini-bridge

Install dependencies:
```
npm install
```
Create a .env file with your Gemini API key:
```
GEMINI_API_KEY=your_gemini_api_key_here
```
Start the server:
```
node server.js
```

Docker Deployment

The project includes Dockerfile and fly.toml for easy deployment to platforms like Fly.io:

Install the fly.io cli:

brew install flyctl

Log into fly.io (create an account first, you can use email, github or google):

flyctl auth login

Deploy the application to fly.io

flyctl launch --name your-application-name --no-deploy

Set your Gemini API key as a secret in Fly.io:

flyctl secrets set GEMINI_API_KEY=your_gemini_api_key_here

fly deploy

Setting Up the Alexa Skill

TL:DR Version:

Create a new skill in the Alexa Developer Console
Use the sample utterances provided in sample_utterances.txt
Create a custom intent named AskGeminiIntent with a Query slot type
Configure the endpoint to point to your deployed instance of this server
Test and publish your skill

More detailed version:

Go to the Alexa Developer Console: https://developer.amazon.com/alexa/console/ask
Click "Create Skill".
Enter "Gemini Assistant" as the skill name.
Select "Custom" model.
Choose "Provision your own" for hosting.
Click "Create skill".
Choose "Start from scratch" template.
Click "Continue with template".

Set up the Invocation Name:

In the left sidebar, click on "Invocation".
Set the Skill Invocation Name to: gemini assistant.
Click "Save Model".

Create the Intent:

In the left sidebar, click on "Intents".
Click "Add Intent".
Create a custom intent named: AskGeminiIntent.
Add the following sample utterances (note that each utterance must be added individually to be accepted. You may want to experiment when it's up and running and add more or change some of these - treat these as WIP suggestions for now).
Scroll down to "Slots".
Add a slot named Query with slot type AMAZON.SearchQuery.
Click "Save Model".

Set up the Endpoint:

In the left sidebar, click on "Endpoint".
Select "HTTPS" as the Service Endpoint Type.
Under "Default Region", enter your Fly.io URL with the correct path:
```
https://your-application-name.fly.dev/alexa-gemini
```
From the dropdown, select "My development endpoint is a sub-domain of a domain that has a wildcard certificate from a certificate authority".
Click "Save Endpoints".

Build and Test the Model:

Click "Build Model" in the top menu.
Wait for the build to complete (this may take a few minutes).
Once built, click on "Test" in the top menu.
Change the dropdown from "Off" to "Development".
In the Alexa simulator, type or say: open gemini assistant.
Then try asking a question like: tell me about the solar system.

Connect to Your Amazon Echo Device

Make sure your Echo device is set up and connected to the same Amazon account you used to create the skill.
Since your skill is in "Development" mode, it's automatically available on all your Echo devices linked to your Amazon developer account.
You can invoke it by saying: Alexa, open gemini assistant.
Then ask your question: What is the capital of France?.

Sample Utterances

The skill supports natural language patterns like:

"ask {Query}"
"tell me {Query}"
"what is {Query}"
"how to {Query}"
"why {Query}"
And many more (see sample_utterances.txt) on the Github repo

An important note here is that Alexa Skills require an invocation word, like the ones above ("ask", "tell me", etc). However, if I use e.g. "what is quantum computing?", the query that actually gets sent to Gemini is simply "quantum computng". Gemini is usually smart enough to work out what I want and provide acceptable output, but sometimes this leads to odd behaviour.

In practice, I simply use "please..." before each request, because this feels right to me (I always thank my chatbots) and it means Gemini gets the whole query to process, without the invocation words at the beginning cut off.

Security Considerations

The Gemini API key is stored as an environment variable
Child-safe prompting is enforced through system instructions
Response content is cleaned of potentially problematic elements

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.