Share on Facebook
Share on Twitter

A Simple Stylometry Analyzer


 
(Should you not like your results, and before cursing the developers, please read the summary at the bottom of the page. Many well-known writers also have Glue Index of over 40 and overuse common words.)

Number of glue words:          
Glue index is:  
Sample length:

Recommendation:


Overused Words: Some stylometry researchers believe books with too many repetitions sell less well.

-ly adverbs:

was/were:

feel/felt/feels/feeling:

initial -ing:

could:

would:

maybe:

smell/taste:

generic descriptions (watch/notice/observe/very):

had:

just/then:

know/knew:

there/it:

think/believe:

hear/heard:

look:

went:

Pronoun Check (experimental feature - results not guaranteed)


Your text after processing:

Glue words are underlined, and overused words are marked in different colours. This output is for statistical analysis, formatting is simplified.



Frequently Asked Questions

* Google Addsense adds are running on this site. Therefore Google can analyse contents of this page including your text, but this is ostensibly only for benign purposes of serving adds that might interest you. For example, if your manuscript is about woodworking, you may see adds for carpentry tools or timber flooring. It is not likely that advertisers will need to copy your entire manuscript. It is much more likely that they would grab a few keywords, so their software knows what adds to serve -- be it for timber flooring or PVC lingerie.

The app itself is not using cookies, not storing anything, or sending your text or data anywhere. If you do not want Google to analyse your text, disconnect from the Internet after loading this page. The app will work offline. Alternatively there are addblockers. The developers of this site do not approve of addblockers, because without adds there would be no Internet.

Notes

Many of us are curious why some books become bestsellers, while equally well-written works achieve only a modest success and languish in obscurity. In 2014 three computer scientists at Stony Brook University decided to find the answer to this burning question. Instead of intuition, hunches and guesswork used by publishing agents hoping to find the next big earners, the scientists employed stylometry and machine-learning to crunch on several hundred titles from Project Gutenberg and Amazon.

The results were astonishing. Yes, there are statistically quantifiable differences. Apparently better-selling books contain a higher proportion of nouns and adjectives, while the authors of non-bestsellers are more fond of verbs and adverbs.

One year later, in 2015, authors' forums on the internet were abuzz with discussions of editing software that counted adjectives, pointed out overused words and sticky sentences. One of the things analysed by editing software is glue index. Apparently the glue index of most bestsellers is no greater than 40%. Words like -- "are", "be", "could", "is", "it", "like", "many", "some", "very", "was", "would", "the" -- are considered sticky. Many of them are required to hold parts of the sentence together -- hence the term "glue". Others are on the list of the most common 200 words in English.

So why having too many glue words is detrimental? Who knows, but perhaps, to a discerning reader's eye sticky words are less exciting and vivid. Could that be the reason why in his 1840 novel, The Old Curiosity Shop, Charles Dickens refers to Kit's mum alternately as "mother" and "Kit's parent"? It so happens that "mother" is on the list of 200 most common words after all, although not all proofreading software includes it. (this one doesn't)

Authors and publishers of less successful books may not like this (it me took several frustrating rewrites to reduce the glue index of the above passage to 40%), but the Big Data does not lie. Paste your test sample into the text area above and run the code. It will calculate the glue index and underline glue words. Every care has been taken to include better-known glue words. The glue index of this text is 40.1%. Please note the above passage includes a small list of glue words. Once removed, the glue index drops to 36.3%.

Several commonly overused words (for instance, adverbs ending with --ly) are also counted, marked in text and recommendation is made. These words are typically analysed by other proofreading software, and we make similar recommendation regarding the optimal proportion of those words.

But before you become dispirited over the sheer inability to bring the glue index under control, please consider that even Stony Brook's study was only 84% accurate in identifying bestsellers. The above application looks just at some limited aspects of data analysis and therefore by definition it would be even less precise. Replacing every occurrence of word "mother" with a "female parent" may reduce the glue index of a story about a parent-child relationship, but it would certainly not improve sales.



Summary

Don't feel too bad if your Glue Index is over 40%, or if you are slightly overusing a particular word. We tested a 4640-word long excerpt of David Copperfield: Chapter I by Charles Dickens. It has a Glue Index of 49.42%. The software also recommends removal of 21 out of 105 "was/were" instances. It demands removal of 3 out of 17 generic descriptions like (watch/notice/observe/very).

It may well be that this vaunted index of 40 is set for generic essay, or even business writing. I tested several of my own business letters and emails and there the index is hovering below 40%. Now, I've been praised for my pithy style and grammar at work, so much so that colleagues ask me to edit their letters, therefore my business writing must be fine. However in novels, where there is dialogue and character thoughts, there is bound to be a higher proportion of 200 most common words in English as opposed to business correspondence.

Testing a 8012-word long excerpt Bleak House: Chapter III brought even bleaker results. Glue index peaked at 50.24, and it registered a significant overuse of "was/were", "feel/felt/feel" and of generic descriptions.

Hard Times: Chapter II. An excerpt from the novel by Charles Dickens, Murdering the Innocents. Produced these results: Glue index is: 48.16. Sample length: 2064. No overuse of "was/were". The software recommends removal of 2 out of 8 generic descriptions. "smell/taste": consider removing 1 out of 2.

Great Expectations: Chapter I. An excerpt from the novel by Charles Dickens. Number of glue words: 910. Glue index is: 49.06 Sample length: 1855. Consider removing 2 out of 35 "was/were" appearances. Consider removing 1 out of 7 "could" words.

All samples of Dickens had adverb index under the recommended threshold.

First 6 chapters of Lord Jim by Joseph Conrad is 23710 words long and has a glue index of 49.8. This software found no overused words here. It look like the app likes Conrad a lot.

The Secret Adversary by Agatha Christie, has a glue index of 47.3 in the first 22483 words. It also recommends removal of 20 out of 441 adverbs ending with -ly. Her adverb index is 1.96.

Obviously using statistical stylometry to analyse modern well-selling novels is more fun. Reader's expectations and taste change over centuries. What was selling fine in 1840, 1900 and 1930 may not tickle the fancy of a mass reader in 2016. So we tested an excerpt of a book by an established bestselling US male author of horror fiction and obtained the following results:

Number of glue words: 2288. Glue index is: 46.94. Sample length: 4874. The app recommends removing 14 out of 102 "was/were" instances. Otherwise there were no overused words and the -ly adverb index is very low at 0.903 (maximum recommended is 1.87). David Copperfield by Dickens has it at 1.595 and Christie scored 1.96.

A 6091-word sample an emerging bestselling US male author with over 22 thousand mainly 5 star reviews on Amazon produced the following results: Glue index: 47.12, low -ly adverb index 1.1. Significant overuse of sentences beginning with --ing. (The app recommends removing 6 out of 10).

An 4525-word excerpt of novel by a bestselling UK female author produced the following results: Glue index is: 42.14 and some overused words found. Adverb index is very high at 2.453. The app recommends removing 26 out of 111 of -ly adverbs. From this small research it appears female writers tend to use more adverbs ending with -ly than male writers. However, this last author achieved a phenomenal commercial success. Test if your writing is male of female with this tool. http://www.hackerfactor.com/GenderGuesser.php#Analyze

So, we can see that even bestselling authors have a glue index of over 40 and overuse some words. Less-bestselling authors overuse words on a massive scale, though. Bear this in mind when editing. Bend the rules, but don't break them. However, three writers listed above are among my favourites, and I enjoy reading them, regardless of their Glue Index.

Let us not throw common sense to the wind, just because the computer said so. And pass the rosy wine.



Support this site by buying my books

Roula: The Girl in the Machine     Artificial Intelligence takes over the world by stealth. Suitable for children.
House of Cain: A War Novel     Russia versus NATO war. Contains violence and sex. Not suitable for children.


Legal Disclaimer. This software is provided as open source, but it is not "free software". This software may not be used for commercial purposes, may not be redistributed on another web site, and may not be reproduced in a different medium in whole or in part without the explicit written permission of the author.
THIS SOFTWARE AVAILABLE ON THE SITE IS PROVIDED "AS IS" AND ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
IN NO EVENT SHALL THE DEVELOPERS OR ANY OF THEIR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

© 2016 HappySnailMedia
Contact me on Twitter Follow me on Twitter Share on Twitter
Share on Facebook