Speech To Text (STT)

The STT is pretty simple as it consists of three steps: activation, acquisition, and translation. Activation can be accomplished via a “key press” but I much rather use voice activation. Assuming you live in a normally quiet atmosphere, it is perfectly practical (and easy) to calculate the root mean square noise (RMS) and activate upon a given threshold. You can set the threshold by acquiring a distribution and looking at standard deviations, or you can just choose a number. Either way you can look at typical RMS values for your given mic/environment using the following:

import audioop
import pyaudio
rms = []
for i in range(0,100):
	p = pyaudio.PyAudio()
	stream = p.open(format=pyaudio.paInt16,channels=1,rate=44100,input=True,frames_per_buffer=1024)
	data = stream.read(1024)
	rmsTemp = audioop.rms(data,2)
	print rmsTemp

I’ve set my threshold to 1050 (an arbitrary value, you should find your own). Now then the first major subroutine of the AI can be set – the listening function. This will essentially run infinitely and its nice to allow this to run as a thread (it may be needed later). This is the basic code for the activation:

import audioop
import pyaudio
def listenToSurroundings(threadName):
		print "Started listening on thread %s" % threadName
		chunk = 1024

		volumeThreshold = 1050

		while (1):
			print "Starting listening stream"
			rmsTemp = 0
			p = pyaudio.PyAudio()
			stream = p.open(format=pyaudio.paInt16,channels=1,rate=16000,input=True,frames_per_buffer=chunk)

			while rmsTemp < volumeThreshold
				data = stream.read(chunk)
				rmsTemp = audioop.rms(data,2)
			output = getUsersVoice(5)
		import traceback
        print traceback.format_exc()

The try/except block is to catch errors, especially useful for the debug stage.

The aquisition and translation stages are done in another subroutine, getUsersVoice. This is a pretty simple code – it will first beep to notify that aquisition has begun. Then it will use arecord to record the audio for a given amount of time. It will beep when finished. Then it will send the text to the Google Speech API. For this last step I use a separate bash file, parseVoiceText.sh just because there are so many quotations. Here is the code:

def getUsersVoice(speakingTime):
	os.system("mpg123 -a hw:YOURALSAPLAYBACK YOURBEEPSOUND.mp3 > /dev/null 2>&1 ")
	os.system("arecord -D plughw:YOURALSARECORDING -f cd -t wav -d %d -r 16000 | flac - -f --best --sample-rate 16000 -o out.flac> /dev/null 2>&1 " % speakingTime)
	os.system("mpg123 -a hw:YOURALSAPLAYBACK YOURBEEPSOUND.mp3 > /dev/null 2>&1 ")
	os.system("./parseVoiceText.sh ")
	output = ""
	with open('txt.out','r') as f:
		output = f.readline()
	print "output:"
	print output[1:-2]
	theOutput = output[1:-2]
	return theOutput

And the bash file:

# parseVoiceText.sh
wget -O - -o /dev/null --post-file out.flac --header="Content-Type: audio/x-flac; rate=16000" http://www.google.com/speech-api/v1/recognize?lang=en | sed -e 's/[{}]/''/g'| awk -v k="text" '{n=split($0,a,","); for (i=1; i<=n; i++) print a[i]; exit }' | awk -F: 'NR==3 { print $3; exit }' > txt.out

As you probably noticed, I didn’t tell you about processInput(). That’s going to be the main function to handle events. I am currently fleshing that out and will post back when I have some more on that.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: