It’s a well known complaint of parents the world-over these days: Kids spend too much time on screens and devices, and not enough time engaged with the active creativity of their own minds. The evil entity called “Technology” has been largely blamed for this phenomenon. Is it possible for Technology to be the answer to this problem rather than the culprit?
The answer may lie in an innovative new app for 8–12 year olds, being developed by entrepreneurs Graig Peterson and Darwin Johnson, called Story Squad. In this “gamefied” creative writing and illustration app, users begin the weekly cycle of play by reading an excerpt from a book or longer story. They are encouraged to write - by hand with a real pen and paper! - their own “side quest”, an imagined offshoot of the main story, born of their own creative little minds. They are also cued to create a drawing by hand, take photos of both pieces, and upload them to the app. These submissions are screened and analyzed behind the scenes (human eyes being a required part of the process by law for this age group). Then, users are grouped into teams of two, and pairs of teams, called squads, are positioned to compete against one another, based upon the determined skill levels of their writing and drawing submissions. Individual players and teams strategically allocate a fixed budget of points among their submissions, in order to, by their own assessments, maximize their chances of winning with these submissions against other teams, based upon user votes. Users then go through a process of voting between pairs of writings and illustrations that are not among those in the face-off within their own squad for the week. Once voting is completed, teams will get to have the results revealed to them, as to whether their team won within their weekly face-off, and which player(s) received points for the victory of their individual submission(s).
The beauty and genius of this concept is that while it is a tech-based activity that occurs through an app, the child users become motivated to really work their creative muscles in the way that Story Squad’s founders know is best for their young brains — offline. Interactions with the app itself are periodic and brief throughout the week, but the meat of the users’ time and effort takes place the old-fashioned way, integrating neuromuscular activity with abstract thinking. At the same time, the game-based and strategic planning aspect of the activity solves the problem for parents, of motivating these kids to spend their recommended time reading, while also reducing screen time.
Of course, accomplishing this great vision is no small technological feat. For the past four weeks, I have been part of a team at Lambda School that is working to advance this app in its development, bringing it a few steps closer to the vision of its founders. As part of the data science team, the primary tasks that we took on during this short period were:
1. Provide possible solutions for the optical character recognition (OCR) problem of taking the uploaded image of the user’s text submission and accurately transcribing it for analysis and assignment of a writing score according to complexity level of the writing. This OCR task is currently being performed by the Google Vision API. Other options bear researching, for the sake of both accuracy and cost optimization. The options to explore are using pytesseract — the Python wrapper for the open source Tesseract OCR engine managed by Google — , or training a new neural network model.
2. Develop a classification and scoring system for the users’ illustration submissions.
3. Develop an integrated “squad score,” which the app will use behind the scenes to pair up users into teams and teams into squads in a fair way based upon skill levels.
I will focus here on point #1 above, as this was where my specific contribution lay. This is a particularly difficult problem, even as OCR problems go, due to the need to recognize and correctly transcribe childrens’ handwriting, as well as deal with varying quality of photos that may be uploaded.
Addressing The Need for “Realistic” Training Data
For any OCR model that is to be used for transcription of our writing samples, a set of images is needed. These images will be used to train a new model, or test a model that already exists. Testing becomes particularly important, because as we will see, these images require pre-processing before being transcribed, for optimal results. The choice and configuration of best types of preprocessing for the particular use case turns out to be part of making the best use of an already existing tool such as Tesseract.
All of this work was performed in a Google Colab notebook. Since I knew I would be using pytesseract, I needed to get this set up to work in Google Colab. Pytesseract needs to be installed as well as the underlying tesseract-ocr engine that it is based upon. I initially hit a snag as I was not able to call pytesseract even after the installation and import. After some investigation, I found out that after installing pytesseract and importing it into Colab, the location of its .exe file on the local machine has to be specified before this library can be called.
Finally pytesseract was ready to use, but first — we need something to feed to it for testing its transcription capabilities on handwriting.
To approach the problem of generating data that would be “realistic” to handwriting samples in terms of variations in image quality and lighting, I decided that I would need to generate images from a known message in a given font, and then also create several different versions of each using different shadings, rotations, and zoom levels. To create the initial images, I began by searching the Google Fonts GitHub Repository for fonts that seemed to me to mimic adolescent handwriting as closely as possible. I downloaded .tff files for five of the best candidates in my estimation. They were nicely varied, looked like handwriting of different styles, and had fun names like Indie Flower, Waiting for the Sunrise, and Just Me Again Down Here.
Next I wanted to write a function that would be serve to create as many different images as desired from different messages, in different fonts, and save this image as well as a desired number of “augmentations,” or alterations of the image. I knew that I wanted to save these images in a persistent way instead of in Colab’s temporary memory, so I mounted my Google Drive to the notebook:
The image-generating function, called image_from_font, will take a number of default arguments, including a list of the desired fonts, and the path at which to save the generated images. These variables are set here:
Here is the function:
def image_from_font(msg, fontfiles=fonts, outfile_type='.jpg', img_size=(600,300), bg_color='lightgrey', font_size=14, pos=(50,50), text_color='rgb(0,0,0)', aug_iters=10, img_out_path = IMG_OUT_PATH):
Produces a set of image files with a given message, in each of
a list of specified fonts, and with some image augmentation
methods - rotation, zoom, and brightness variations.
Each image is output to a file named with the start of the
message, the font name, and the augmentation iteration number.
The final image produced is also displayed as a sample.
These images can be used to generate data for testing and/or
training of any OCR model.
Inputs: font = a list of font files in .ttf format
msg = the message to be drawn on the image
outfile_type = the extension to add to the image file,
which will determine the file type: .jpg, .png, .gif,
.bmp, or .tiff
img_size = a tuple specifying the size of the image
bg_color = background color of the image
font_size = size of the font
pos = a tuple specifying the starting position of the
message, where (0,0) is the top left corner of the image
text_color = color of the font text
aug_iters = number of augmented variations of the image
to produce per font.
for style in fontfiles:
# Draw a simple image with one color, as our background
img = Image.new('RGB', (700, 1000), color = 'lightgrey')
# initialize the drawing context with
# the image object as background
draw = ImageDraw.Draw(img)
# create font object with the font file and specify
# desired size
font_obj = ImageFont.truetype(style, size=font_size)
# draw the message on the background
draw.text(pos, msg, fill=text_color, font=font_obj)
# format file names as a string,
# incorporating font name and msg
img.save(IMG_OUT_PATH + "/" + msg[0:10].replace(" ", "_") +
"_" + style.strip('.ttf') + "_raw" + outfile_type)
# convert to numpy array
data = img_to_array(img)
# expand dimension to one sample
samples = expand_dims(data, 0)
# create image data augmentation generator
# ---> these ranges may require some adjustment for
# ---> to prevent cutting off some of the text
datagen = ImageDataGenerator(rotation_range = 15,
brightness_range = [0.8,1.2])
# prepare iterator
it = datagen.flow(samples, batch_size=1)
# generate samples and plot
for i in range(aug_iters):
# generate batch of images
batch = it.next()
# convert to unsigned integers for viewing
image = batch.astype('uint8')
# change augmented numpy array image back to image
img_aug = Image.fromarray(image)
# save augmented image to file
img_aug.save(IMG_OUT_PATH + "/" +
msg[0:10].replace(" ", "_") +
"_" + style.strip('.ttf') + "_aug" +
str(i) + outfile_type)
# show the final figure, as an example
Here is an example of an image output from this function, using Indie Flower font. The text is the first paragraph of Call of the Wild, by Jack London. (Another side benefit of this project was that I was reminded what a great read this book is!)
We can run a first quick test by passing this image into tesseract as is, with no modifications. The code to do so (for the path where I saved my generated images) and its output, look like this:
Clearly not acceptable — in addition to all the mistakes, for some reason tesseract didn’t even see the first three lines of the text!
In an effort to optimize what tesseract can do, first I tried playing with the configuration of the page segmentation mode. Setting this psm to an integer between 0 and 13 is an optional argument within the call to pytesseract.image_to_string. The default is 3, which tells the OCR engine to segment the page automatically as it sees fit. I decided to try changing this to 6, as shown above, which means to assume we have a single uniform block of text. This seemed like the situation at hand, but the results were exactly the same. I guess page segmentation was not the problem here! Other options for the psm setting caused it to perform even worse on each of the fonts.
The next important thing to do to help the performance of our OCR tool was to pre-process the images in various ways to see what helps the most. I collected and prepared a few pre-processing functions to see which ones would be applicable in this case:
# get grayscale of image
return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
kernel = np.ones((5,5),np.uint8)
return cv2.dilate(image, kernel, iterations = 1)
kernel = np.ones((5,5),np.uint8)
return cv2.erode(image, kernel, iterations = 1)
# canny edge detection
return cv2.Canny(image, 100, 200)
# skew correction
coords = np.column_stack(np.where(image > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = -(90 + angle)
angle = -angle
(h, w) = image.shape[:2]
center = (w // 2, h // 2)
M = cv2.getRotationMatrix2D(center, angle, 1.0)
rotated = cv2.warpAffine(image, M, (w, h), flags=cv2.INTER_CUBIC, borderMode=cv2.BORDER_REPLICATE)
# template matching
def match_template(image, template):
return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
The most relevant ones turned out to be thresholding, noise-removal, and deskewing. I applied these as well as starting with grayscaling just for good measure. If an image has no color in it to begin with, then grayscaling will not make any difference, but pytesseract will not work with any color involved.
This result is improved with pre-processing. This font, “Indie Flower,” produced moderate results like this. Pytesseract did much better with the font called “Chilanka”, and much worse with “Waiting for the Sunrise”. Still, this result is clearly not acceptable for use in the Story Squad application.
Regardless of the results in these test cases, genuine handwritten samples are more relevant to our use case. No matter how “handwriting-like” a font may be, there area few characteristics of fonts that make them easier for a ML algorithm to work with than real handwriting: 1. With a font each letter looks exactly the same each time it occurs, 2. There is a uniform whitespace between letters and uniform whitespace between words, and 3. Individual letters do not touch each other, unless that is part of the design of the font. These three things are not true of handwriting. This means that for any ML algorithm to be worth its salt at transcribing handwriting, it’d best be trained on the particular style of handwriting on which you are interested in it performing best.
To illustrate what a difficult time an off-the-shelf tool like pytesseract will have with handwriting, I obtained just a few samples to play with.
And pytesseract’s transcription:
Here’s a poignant statement from an 8-year-old girl:
… and pytesseract’s unfortunate attempt at transcribing this:
Even if you ignore all of the line break characters thrown in, clearly pytesseract couldn’t make heads or tails of this writing. The first problem becomes clear when you realize that in order to guess what letter is has at each point, the model first has to be able to identify where those letters are. It does this by application of bounding boxes. These boxes can be visualized, and doing so is very telling of the difficulty here:
# Plot character-level boxes on image using pytesseract.image_to_boxes()
image = cv2.imread('/content/drive/MyDrive/Lambda/Labs/real_handwriting/girl_8.jpg')
h, w, c = image.shape
boxes = pytesseract.image_to_boxes(image)
for b in boxes.splitlines():
b = b.split(' ')
image = cv2.rectangle(image, (int(b), h - int(b)), (int(b), h - int(b)), (0, 255, 0), 2)
b,g,r = cv2.split(image)
rgb_img = cv2.merge([r,g,b])
plt.title("Girl's Handwriting age 8 WITH CHARACTER LEVEL BOXES")
The lined paper was part of what caused a problem, and some image processing could be done to remove those lines. But for now, the more important point is the one that pertains to the better solution to this problem going forward.
Future Direction for Progress
Clearly some new training is required for real handwriting, on a character level basis. Obtaining the training data for this is still an obstacle, but one stop-gap measure would be making use of the EMNIST dataset, which is publicly available on GitHub. This dataset contains over 146,000 labeled images of individual letters drawn by people as they filled out census documents, as they entered their individual characters into the neat little boxes provided on the forms. While these are adults and not children, it would be a place to start in order to train a basic neural network model, even such as a multilayer perceptron to begin with. The advantage of training the model on real handwriting with all of its variations, compared with typeset fonts would be enormous. The next technical hurdle would be that in order to make predictions, such a model would need to take in individual characters one at a time. Automating the determination of bounding boxes to obtain these individual letters from an image such as the one from the 8-year-old girl above seems to be a difficult problem in and of itself.
The “wild west” domain of OCR is, for good reason, a hot topic of research. There are endless meaningful applications to stimulate our imagination and improve the quality of our lives. The mission of Story Squad is an ambitious and highly worthy pursuit. I am honored to have been able to contribute in my own small way to making the vision of its founders into a reality. I can’t wait to see the impact is has on the lives of so many kids (and their parents!) in the near future.
The code referred to in this post can be found here.