Congratulations!

[Valid RSS] This is a valid RSS feed.

Recommendations

This feed is valid, but interoperability with the widest range of feed readers could be improved by implementing the following recommendations.

Source: https://www.peterbe.com/rss.xml

  1. <?xml version="1.0" encoding="utf-8"?>
  2. <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Peterbe.com</title><link>https://www.peterbe.com/rss.xml</link><description>Stuff in Peter's head</description><atom:link href="https://www.peterbe.com/rss.xml" rel="self"></atom:link><language>en-us</language><lastBuildDate>Thu, 14 Nov 2019 17:19:25 +0000</lastBuildDate><item><title>MDN Documents Size Tree Map</title><link>https://www.peterbe.com/plog/mdn-documents-size-tree-map</link><description>&lt;p&gt;Recently I've been playing with the content of &lt;a href="https://developer.mozilla.org"&gt;MDN&lt;/a&gt; as a whole. MDN has ~140k documents in its Wiki. About ~70k of them are redirects which is the result of many years of switching tech and switching information architecture and at the same time being good Internet citizens and avoiding 404s. So, out of the ~70k documents, how do they spread? To answer that I wrote a Python script that evaluates size as a matter of the sum of all the files in sub-trees including pictures.&lt;/p&gt;
  3. &lt;p&gt;Here are the screenshots:&lt;/p&gt;
  4. &lt;p&gt;&lt;em&gt;All locales&lt;/em&gt;&lt;/p&gt;
  5. &lt;p&gt;&lt;a href="/cache/1a/44/1a4437fed41435eba2811565a7f7e4ff.png"&gt;&lt;img src="/cache/4f/32/4f3212ca14b3d6e07369f3904c72605f.png" alt="All locales"&gt;&lt;/a&gt;&lt;/p&gt;
  6. &lt;p&gt;&lt;em&gt;Specifically en-US&lt;/em&gt;&lt;/p&gt;
  7. &lt;p&gt;&lt;a href="/cache/51/8c/518c71e9c0be0cb04de96cd911fbb5e5.png"&gt;&lt;img src="/cache/98/01/98012ab8d0b62e774dfdd2cb21f64d65.png" alt="Specifically en-US"&gt;&lt;/a&gt;&lt;/p&gt;
  8. &lt;p&gt;The code that puts this together uses &lt;a href="https://ui.toast.com/tui-chart/"&gt;Toast UI&lt;/a&gt; which seems cool but I didn't spend much time worrying about how to use it.&lt;/p&gt;
  9. &lt;p&gt;Be warned! Opening this link will make your browser sweat: &lt;a href="https://8mw9v.csb.app/"&gt;https://8mw9v.csb.app/&lt;/a&gt;&lt;/p&gt;
  10. &lt;p&gt;You can fork it here: &lt;a href="https://codesandbox.io/s/zen-swirles-8mw9v"&gt;https://codesandbox.io/s/zen-swirles-8mw9v&lt;/a&gt;&lt;/p&gt;</description><pubDate>Thu, 14 Nov 2019 17:19:25 +0000</pubDate><guid>https://www.peterbe.com/plog/mdn-documents-size-tree-map</guid></item><item><title>Avoid async  when all you have is (SSD) disk I/O in NodeJS</title><link>https://www.peterbe.com/plog/avoid-async-disk-io-in-nodejs</link><description>tl;dr; If you know that the only I/O you have is disk and the disk is SSD, then synchronous is probably more convenient, faster, and more memory lean.</description><pubDate>Thu, 24 Oct 2019 20:43:42 +0000</pubDate><guid>https://www.peterbe.com/plog/avoid-async-disk-io-in-nodejs</guid></item><item><title>Update to speed comparison for Redis vs PostgreSQL storing blobs of JSON</title><link>https://www.peterbe.com/plog/update-to-speed-comparison-for-redis-vs-postgresql-storing-blobs-of-json</link><description>&lt;p&gt;Last week, I blogged about &lt;a href="/plog/redis-vs-postgres-blob-of-json"&gt;"How much faster is Redis at storing a blob of JSON compared to PostgreSQL?"&lt;/a&gt;. Judging from a lot of comments, people misinterpreted this. (By the way, Redis &lt;em&gt;is&lt;/em&gt; persistent). It's no surprise that Redis is faster.&lt;/p&gt;
  11. &lt;p&gt;However, it's a fact that I have do have a lot of blobs stored and need to present them via the web API as fast as possible. It's rare that I want to do relational or batch operations on the data. But Redis isn't a slam dunk for simple retrieval because I don't know if I trust its integrity with the 3GB worth of data that I both don't want to lose and don't want to load all into RAM.&lt;/p&gt;
  12. &lt;p&gt;&lt;strong&gt;But is it entirely wrong to look at WHICH database to get the best speed?&lt;/strong&gt;&lt;/p&gt;
  13. &lt;p&gt;Reviewing this corner of &lt;a href="https://songsear.ch"&gt;Song Search&lt;/a&gt; helped me rethink this. PostgreSQL is, in my view, a better database for storing stuff. Redis is faster for individual lookups. But you know what's even faster? &lt;strong&gt;Nginx&lt;/strong&gt;&lt;/p&gt;
  14. &lt;h3&gt;Nginx??&lt;/h3&gt;
  15. &lt;p&gt;The way the application works is that a React web app is requesting the Amazon product data for the sake of presenting an appropriate affiliate link. This is done by the browser essentially doing:&lt;/p&gt;
  16. &lt;div class="highlight"&gt;
  17.  
  18. &lt;pre&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;https://songsear.ch/api/song/5246889/amazon&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  19. &lt;/pre&gt;&lt;/div&gt;
  20.  
  21. &lt;p&gt;Internally, in the app, what it does is that it looks this up, by ID, on the &lt;code&gt;AmazonAffiliateLookup&lt;/code&gt; ORM model. Suppose it wasn't there in the PostgreSQL, it uses the Amazon Affiliate Product Details API, to look it up and when the results come in it stores a copy of this in PostgreSQL so we can re-use this URL without hitting rate limits on the Product Details API. Lastly, in a piece of Django view code, it carefully scrubs and repackages this result so that only the fields used by the React rendering code is shipped between the server and the browser. That "scrubbed" piece of data is actually much smaller. Partly because it limits the results to the first/best match and it deletes a bunch of things that are never needed such as &lt;code&gt;ProductTypeName&lt;/code&gt;, &lt;code&gt;Studio&lt;/code&gt;, &lt;code&gt;TrackSequence&lt;/code&gt; etc. The proportion is roughly 23x. I.e. of the 3GB of JSON blobs stored in PostgreSQL only 130MB is ever transported from the server to the users.&lt;/p&gt;
  22. &lt;h3&gt;Again, Nginx?&lt;/h3&gt;
  23. &lt;p&gt;Nginx has a built in &lt;a href="https://www.nginx.com/blog/nginx-caching-guide/"&gt;reverse HTTP proxy cache&lt;/a&gt; which is easy to set up but a bit hard to do purges on. The biggest flaw, in my view, is that it's hard to get a handle of how much RAM this it's eating up. Well, if the &lt;em&gt;total&lt;/em&gt; possible amount of data within the server is 130MB, then that is something I'm perfectly comfortable to let Nginx handle cache in RAM.&lt;/p&gt;
  24. &lt;p&gt;Good HTTP performance benchmarking is hard to do but here's a teaser from my local laptop version of Nginx:&lt;/p&gt;
  25. &lt;pre&gt;▶ hey -n 10000 -c 10 https://songsearch.local/api/song/1810960/affiliate/amazon-itunes
  26.  
  27. Summary:
  28.  Total:    0.9882 secs
  29.  Slowest:  0.0279 secs
  30.  Fastest:  0.0001 secs
  31.  Average:  0.0010 secs
  32.  Requests/sec: 10119.8265
  33.  
  34.  
  35. Response time histogram:
  36.  0.000 [1] |
  37.  0.003 [9752]  |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  38.  0.006 [108]   |
  39.  0.008 [70]    |
  40.  0.011 [32]    |
  41.  0.014 [8] |
  42.  0.017 [12]    |
  43.  0.020 [11]    |
  44.  0.022 [1] |
  45.  0.025 [4] |
  46.  0.028 [1] |
  47.  
  48.  
  49. Latency distribution:
  50.  10% in 0.0003 secs
  51.  25% in 0.0006 secs
  52.  50% in 0.0008 secs
  53.  75% in 0.0010 secs
  54.  90% in 0.0013 secs
  55.  95% in 0.0016 secs
  56.  99% in 0.0068 secs
  57.  
  58. Details (average, fastest, slowest):
  59.  DNS+dialup:   0.0000 secs, 0.0001 secs, 0.0279 secs
  60.  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0026 secs
  61.  req write:    0.0000 secs, 0.0000 secs, 0.0011 secs
  62.  resp wait:    0.0008 secs, 0.0001 secs, 0.0206 secs
  63.  resp read:    0.0001 secs, 0.0000 secs, 0.0013 secs
  64.  
  65. Status code distribution:
  66.  [200] 10000 responses&lt;/pre&gt;
  67.  
  68. &lt;p&gt;10,000 requests across 10 clients at rougly 10,000 requests per second. That includes doing all the HTTP parsing, WSGI stuff, forming of a SQL or Redis query, the deserialization, the Django JSON HTTP response serialization etc. The cache TTL is controlled by simply setting a &lt;code&gt;Cache-Control&lt;/code&gt; HTTP header with something like &lt;code&gt;max-age=86400&lt;/code&gt;.&lt;/p&gt;
  69. &lt;p&gt;Now, repeated fetches for this are cached at the Nginx level and it means it doesn't even matter how slow/fast the database is. As long as it's not taking seconds, with a long &lt;code&gt;Cache-Control&lt;/code&gt;, Nginx can hold on to this in RAM for days or until the whole server is restarted (which is rare).&lt;/p&gt;
  70. &lt;h3&gt;Conclusion&lt;/h3&gt;
  71. &lt;p&gt;If you the total amount of data that can and will be cached is controlled, putting it in a HTTP reverse proxy cache is probably order of magnitude faster than messing with chosing which database to use.&lt;/p&gt;</description><pubDate>Mon, 30 Sep 2019 03:06:41 +0000</pubDate><guid>https://www.peterbe.com/plog/update-to-speed-comparison-for-redis-vs-postgresql-storing-blobs-of-json</guid></item><item><title>How much faster is Redis at storing a blob of JSON compared to PostgreSQL?</title><link>https://www.peterbe.com/plog/redis-vs-postgres-blob-of-json</link><description>tl;dr; Redis is 16 times faster at reading these JSON blobs.*</description><pubDate>Sat, 28 Sep 2019 15:50:47 +0000</pubDate><guid>https://www.peterbe.com/plog/redis-vs-postgres-blob-of-json</guid></item><item><title>uwsgi weirdness with --http</title><link>https://www.peterbe.com/plog/uwsgi-weirdness-with---http</link><description>&lt;p&gt;Instead of upgrading everything on my server, I'm just starting from scratch. From Ubuntu 16.04 to Ubuntu 19.04 and I also upgraded everything else in sight. One of them was &lt;code&gt;uwsgi&lt;/code&gt;. I copied various user config files but for &lt;code&gt;uwsgi&lt;/code&gt; things didn't very well. On the old server I had &lt;code&gt;uwsgi&lt;/code&gt; version &lt;code&gt;2.0.12-debian&lt;/code&gt; and on the new one &lt;code&gt;2.0.18-debian&lt;/code&gt;. The &lt;a href="https://uwsgi-docs.readthedocs.io/en/latest/index.html#stable-releases"&gt;uWSGI changelog&lt;/a&gt; is pretty hard to read but I sure don't see any mention of this.&lt;/p&gt;
  72. &lt;p&gt;You see, on &lt;a href="https://songsear.ch"&gt;SongSearch&lt;/a&gt; I have it so that Nginx talks to Django via a uWSGI socket. But the NodeJS server talks to Django via &lt;code&gt;127.0.0.1:PORT&lt;/code&gt;. So I need my uWSGI config to start both. Here was the old config:&lt;/p&gt;
  73. &lt;pre&gt;[uwsgi]
  74. plugins = python35
  75. virtualenv = /var/lib/django/songsearch/venv
  76. pythonpath = /var/lib/django/songsearch
  77. user = django
  78. uid = django
  79. master = true
  80. processes = 3
  81. enable-threads = true
  82. touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
  83. http = 127.0.0.1:9090
  84. module = songsearch.wsgi:application
  85. env = LANG=en_US.utf8
  86. env = LC_ALL=en_US.UTF-8
  87. env = LC_LANG=en_US.UTF-8&lt;/pre&gt;
  88.  
  89. &lt;p&gt;(The only difference on the new server was the &lt;code&gt;python37&lt;/code&gt; plugin instead)&lt;/p&gt;
  90. &lt;p&gt;I start it and everything looks fine. No errors in the log files. And &lt;code&gt;netstat&lt;/code&gt; looks like this:&lt;/p&gt;
  91. &lt;pre&gt;# netstat -ntpl | grep 9090
  92. tcp        0      0 127.0.0.1:9090          0.0.0.0:*               LISTEN      1855/uwsgi&lt;/pre&gt;
  93.  
  94. &lt;p&gt;But every time I try to &lt;code&gt;curl localhost:9090&lt;/code&gt; I kept getting &lt;code&gt;curl: (52) Empty reply from server&lt;/code&gt;. Nothing in the log files! It seemed no matter what I tried I just couldn't talk to it over HTTP. No, I'm not a sysadmin. I'm just a hobbyist trying to stand up my little server with the tools and limited techniques I know but I was stumped.&lt;/p&gt;
  95. &lt;h3&gt;The solution&lt;/h3&gt;
  96. &lt;p&gt;After endless Googling for a resolution and trying all sorts of &lt;code&gt;uwsgi&lt;/code&gt; commands directly, I somehow stumbled on the solution.&lt;/p&gt;
  97. &lt;div class="highlight"&gt;
  98.  
  99. &lt;pre&gt;[uwsgi]
  100. plugins = python35
  101. virtualenv = /var/lib/django/songsearch/venv
  102. pythonpath = /var/lib/django/songsearch
  103. user = django
  104. uid = django
  105. master = true
  106. processes = 3
  107. enable-threads = true
  108. touch-reload = /var/lib/django/songsearch/uwsgi-reload.touch
  109. &lt;span class="gd"&gt;-http = 127.0.0.1:9090&lt;/span&gt;
  110. &lt;span class="gi"&gt;+http-socket = 127.0.0.1:9090&lt;/span&gt;
  111. module = songsearch.wsgi:application
  112. env = LANG=en_US.utf8
  113. env = LC_ALL=en_US.UTF-8
  114. env = LC_LANG=en_US.UTF-8
  115. &lt;/pre&gt;&lt;/div&gt;
  116.  
  117. &lt;p&gt;With this one subtle change, I can now &lt;code&gt;curl localhost:9090&lt;/code&gt; &lt;em&gt;and&lt;/em&gt; I still have the &lt;code&gt;/var/run/uwsgi/app/songsearch/socket&lt;/code&gt; socket.  So, yay!&lt;/p&gt;
  118. &lt;p&gt;I'm blogging about this in case someone else ever gets stuck in the same nasty surprise as me.&lt;/p&gt;
  119. &lt;p&gt;Also, I have to admit, I was fuming with rage from this frustration. It's really inspired me to revive the quest for an alternative to &lt;code&gt;uwsgi&lt;/code&gt; because I'm not sure it's that great anymore. There are new alternatives such as &lt;code&gt;gunicorn&lt;/code&gt;, &lt;code&gt;gunicorn&lt;/code&gt; with &lt;code&gt;Meinheld&lt;/code&gt;, &lt;code&gt;bjoern&lt;/code&gt; etc.&lt;/p&gt;</description><pubDate>Thu, 19 Sep 2019 13:20:30 +0000</pubDate><guid>https://www.peterbe.com/plog/uwsgi-weirdness-with---http</guid></item><item><title>Fastest Python function to slugify a string</title><link>https://www.peterbe.com/plog/fastest-python-function-to-slugify-a-string</link><description>&lt;p&gt;In MDN I noticed &lt;a href="https://github.com/mozilla/kuma/blob/d7381720c8057ae42eb37738d9109715f8b6ce97/kuma/wiki/content.py#L577-L592"&gt;a function that turns a piece of text (Python 2 unicode) into a slug&lt;/a&gt;. It looks like this:&lt;/p&gt;
  120. &lt;div class="highlight"&gt;
  121.  
  122. &lt;pre&gt;    &lt;span class="n"&gt;non_url_safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;quot;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;#&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;$&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;%&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;amp;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  123.                    &lt;span class="s1"&gt;&amp;#39;,&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;/&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;:&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;=&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;?&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  124.                    &lt;span class="s1"&gt;&amp;#39;@&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;[&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="se"&gt;\\&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;]&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;^&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;`&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  125.                    &lt;span class="s1"&gt;&amp;#39;{&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;|&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;}&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;~&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;&amp;#39;&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  126.  
  127.    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;slugify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  128.        &lt;span class="sd"&gt;&amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
  129. &lt;span class="sd"&gt;        Turn the text content of a header into a slug for use in an ID&lt;/span&gt;
  130. &lt;span class="sd"&gt;        &amp;quot;&amp;quot;&amp;quot;&lt;/span&gt;
  131.        &lt;span class="n"&gt;non_safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;non_url_safe&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  132.        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;non_safe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  133.            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;non_safe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  134.                &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  135.        &lt;span class="c1"&gt;# Strip leading, trailing and multiple whitespace, convert remaining whitespace to _&lt;/span&gt;
  136.        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  137.        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
  138. &lt;/pre&gt;&lt;/div&gt;
  139.  
  140. &lt;p&gt;The code is 7-8 years old and relates to a migration when MDN was created as a Python fork from an existing PHP solution.&lt;/p&gt;
  141. &lt;p&gt;I couldn't help but to react to the fact that it's a list and it's looped over every single time. Twice, in a sense. Python has built-in tools for this kinda stuff. Let's see if I can make it faster.&lt;/p&gt;
  142. &lt;h3&gt;The candidates&lt;/h3&gt;
  143. &lt;div class="highlight"&gt;
  144.  
  145. &lt;pre&gt;&lt;span class="n"&gt;translate_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;ord&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;char&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;char&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;non_url_safe&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
  146. &lt;span class="n"&gt;non_url_safe_regex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  147.    &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;[{}]&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;escape&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;non_url_safe&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
  148.  
  149.  
  150. &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_slugify1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  151.    &lt;span class="n"&gt;non_safe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;non_url_safe&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  152.    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;non_safe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  153.        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;non_safe&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  154.            &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  155.    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  156.    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
  157.  
  158. &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_slugify2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  159.    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;translate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;translate_table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  160.    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  161.    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
  162.  
  163. &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_slugify3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  164.    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;non_url_safe_regex&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  165.    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;u&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;_&amp;#39;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;\s+&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  166.    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
  167. &lt;/pre&gt;&lt;/div&gt;
  168.  
  169. &lt;p&gt;I wrote a thing that would call each one of the candidates, assert that their outputs always match and store how long each one took.&lt;/p&gt;
  170. &lt;h3&gt;The results&lt;/h3&gt;
  171. &lt;p&gt;&lt;strong&gt;The slowest is fast enough.&lt;/strong&gt; But if you're still reading, here are the results:&lt;/p&gt;
  172. &lt;pre&gt;_slugify1 0.101ms
  173. _slugify2 0.019ms
  174. _slugify3 0.033ms&lt;/pre&gt;
  175.  
  176. &lt;p&gt;So &lt;strong&gt;using a translate table is 5 times faster. And a regex 3 times faster&lt;/strong&gt;.  But they're all sufficiently fast.&lt;/p&gt;
  177. &lt;h3&gt;Conclusion&lt;/h3&gt;
  178. &lt;p&gt;This is the least of your problems in a world of real I/O such as databases and other genuinely CPU intense stuff. Well, it was fun little side-trip.&lt;/p&gt;
  179. &lt;p&gt;Also, aren't there better solutions that just blacklist &lt;em&gt;all&lt;/em&gt; control characters?&lt;/p&gt;</description><pubDate>Thu, 12 Sep 2019 20:20:32 +0000</pubDate><guid>https://www.peterbe.com/plog/fastest-python-function-to-slugify-a-string</guid></item><item><title>NodeJS fs walk() or glob or fast-glob</title><link>https://www.peterbe.com/plog/nodejs-fs-walk-or-glob-or-fast-glob</link><description>&lt;p&gt;It started with this:&lt;/p&gt;
  180. &lt;div class="highlight"&gt;
  181.  
  182. &lt;pre&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nx"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;filepaths&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  183.    &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;files&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;readdirSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  184.    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;files&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  185.        &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filepath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;directory&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  186.        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;statSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;isDirectory&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  187.            &lt;span class="nx"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;filepaths&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  188.        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;extname&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filename&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="s1"&gt;&amp;#39;.md&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  189.            &lt;span class="nx"&gt;filepaths&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filepath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  190.        &lt;span class="p"&gt;}&lt;/span&gt;
  191.    &lt;span class="p"&gt;}&lt;/span&gt;
  192.    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;filepaths&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  193. &lt;span class="p"&gt;}&lt;/span&gt;
  194. &lt;/pre&gt;&lt;/div&gt;
  195.  
  196. &lt;p&gt;And you use it like this:&lt;/p&gt;
  197. &lt;div class="highlight"&gt;
  198.  
  199. &lt;pre&gt;&lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;foundFiles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;someDirectoryOfMine&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  200. &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;foundFiles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  201. &lt;/pre&gt;&lt;/div&gt;
  202.  
  203. &lt;p&gt;I thought, perhaps it's faster or better to use &lt;a href="https://www.npmjs.com/package/glob"&gt;&lt;code&gt;glob&lt;/code&gt;&lt;/a&gt;. So I installed that.&lt;br /&gt;
  204. Then I found, &lt;a href="https://www.npmjs.com/package/fast-glob"&gt;&lt;code&gt;fast-glob&lt;/code&gt;&lt;/a&gt; which sounds faster. You use both in a synchronous way.&lt;/p&gt;
  205. &lt;p&gt;I have a directory with about 450 files, of which 320 of them are &lt;code&gt;.md&lt;/code&gt; files. Let's compare:&lt;/p&gt;
  206. &lt;pre&gt;walk: 10.212ms
  207. glob: 37.492ms
  208. fg: 14.200ms&lt;/pre&gt;
  209.  
  210. &lt;p&gt;I measured it using &lt;code&gt;console.time&lt;/code&gt; like this:&lt;/p&gt;
  211. &lt;div class="highlight"&gt;
  212.  
  213. &lt;pre&gt;&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;walk&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  214. &lt;span class="kr"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;foundFiles&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;walk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;someDirectoryOfMine&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  215. &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;timeEnd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;walk&amp;#39;&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  216. &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;foundFiles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  217. &lt;/pre&gt;&lt;/div&gt;
  218.  
  219. &lt;p&gt;I suppose those packages have other fancier features but, I guess this just goes to show, keep it simple.&lt;/p&gt;</description><pubDate>Sat, 31 Aug 2019 02:25:53 +0000</pubDate><guid>https://www.peterbe.com/plog/nodejs-fs-walk-or-glob-or-fast-glob</guid></item><item><title>Train your own spell corrector with TextBlob</title><link>https://www.peterbe.com/plog/train-your-own-spell-corrector-with-textblob</link><description>&lt;p&gt;&lt;a href="https://textblob.readthedocs.io/en/dev/quickstart.html#spelling-correction"&gt;TextBlob&lt;/a&gt; is a wonderful Python library it. It wraps &lt;a href="https://pypi.org/project/nltk/"&gt;&lt;code&gt;nltk&lt;/code&gt;&lt;/a&gt; with a really pleasant API. Out of the box, you get a spell-corrector. From the tutorial:&lt;/p&gt;
  220. &lt;div class="highlight"&gt;
  221.  
  222. &lt;pre&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;textblob&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextBlob&lt;/span&gt;
  223. &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TextBlob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot;I havv goood speling!&amp;quot;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  224. &lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  225. &lt;span class="s1"&gt;&amp;#39;I have good spelling!&amp;#39;&lt;/span&gt;
  226. &lt;/pre&gt;&lt;/div&gt;
  227.  
  228. &lt;p&gt;The way it works is that, shipped with the library, is this text file: &lt;a href="https://github.com/sloria/TextBlob/blob/dev/textblob/en/en-spelling.txt"&gt;en-spelling.txt&lt;/a&gt; It's about 30,000 lines long and looks like this:&lt;/p&gt;
  229. &lt;pre&gt;;;;   Based on several public domain books from Project Gutenberg
  230. ;;;   and frequency lists from Wiktionary and the British National Corpus.
  231. ;;;   http://norvig.com/big.txt
  232. ;;;  
  233. a 21155
  234. aah 1
  235. aaron 5
  236. ab 2
  237. aback 3
  238. abacus 1
  239. abandon 32
  240. abandoned 72
  241. abandoning 27&lt;/pre&gt;
  242.  
  243. &lt;p&gt;That gave me an idea! How about I use the &lt;code&gt;TextBlob&lt;/code&gt; API but bring my own text as the training model. It doesn't have to be all that complicated.&lt;/p&gt;
  244. &lt;h3&gt;The challenge&lt;/h3&gt;
  245. &lt;p&gt;(Note: All the code I used for this demo is available here: &lt;a href="https://github.com/peterbe/spellthese"&gt;github.com/peterbe/spellthese&lt;/a&gt;)&lt;/p&gt;
  246. &lt;p&gt;I found &lt;a href="https://www.verywellfamily.com/top-1000-baby-boy-names-2757618"&gt;this site&lt;/a&gt; that lists "Top 1,000 Baby Boy Names". From that list, randomly pick a couple of out and mess with their spelling. Like, remove letters, add letters, and swap letters.&lt;/p&gt;
  247. &lt;p&gt;So, 5 random names now look like this:&lt;/p&gt;
  248. &lt;pre&gt;▶ python challenge.py
  249. RIGHT: jameson  TYPOED: jamesone
  250. RIGHT: abel     TYPOED: aabel
  251. RIGHT: wesley   TYPOED: welsey
  252. RIGHT: thomas   TYPOED: thhomas
  253. RIGHT: bryson   TYPOED: brysn&lt;/pre&gt;
  254.  
  255. &lt;p&gt;Imagine some application, where fat-fingered users typo those names on the right-hand side, and your job is to map that back to the correct spelling.&lt;/p&gt;
  256. &lt;p&gt;First, let's use the built in &lt;code&gt;TextBlob.correct&lt;/code&gt;. A bit simplified but it looks like this:&lt;/p&gt;
  257. &lt;div class="highlight"&gt;
  258.  
  259. &lt;pre&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;textblob&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextBlob&lt;/span&gt;
  260.  
  261.  
  262. &lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;typo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_random_name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  263. &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;TextBlob&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  264. &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
  265. &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
  266. &lt;span class="o"&gt;...&lt;/span&gt;
  267. &lt;/pre&gt;&lt;/div&gt;
  268.  
  269. &lt;p&gt;And the results:&lt;/p&gt;
  270. &lt;pre&gt;▶ python test.py
  271. ORIGIN         TYPO           RESULT         WORKED?
  272. jesus          jess           less           Fail
  273. austin         ausin          austin         Yes!
  274. julian         juluian        julian         Yes!
  275. carter         crarter        charter        Fail
  276. emmett         emett          met            Fail
  277. daniel         daiel          daniel         Yes!
  278. luca           lua            la             Fail
  279. anthony        anthonyh       anthony        Yes!
  280. damian         daiman         cabman         Fail
  281. kevin          keevin         keeping        Fail
  282. Right 40.0% of the time&lt;/pre&gt;
  283.  
  284. &lt;p&gt;Buuh! Not very impressive. So what went wrong there? Well, the word &lt;code&gt;met&lt;/code&gt; is much more common than &lt;code&gt;emmett&lt;/code&gt; and the same goes for words like &lt;code&gt;less&lt;/code&gt;, &lt;code&gt;charter&lt;/code&gt;, &lt;code&gt;keeping&lt;/code&gt; etc. You know, because English.&lt;/p&gt;
  285. &lt;h3&gt;The solution&lt;/h3&gt;
  286. &lt;p&gt;The solution is actually really simple. You just crack open the classes out of &lt;code&gt;textblob&lt;/code&gt; like this:&lt;/p&gt;
  287. &lt;div class="highlight"&gt;
  288.  
  289. &lt;pre&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;textblob&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;TextBlob&lt;/span&gt;
  290. &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;textblob.en&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Spelling&lt;/span&gt;
  291.  
  292. &lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;spelling-model.txt&amp;quot;&lt;/span&gt;
  293. &lt;span class="n"&gt;spelling&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Spelling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  294. &lt;span class="c1"&gt;# Here, &amp;#39;names&amp;#39; is a list of all the 1,000 correctly spelled names.&lt;/span&gt;
  295. &lt;span class="c1"&gt;# e.g. [&amp;#39;Liam&amp;#39;, &amp;#39;Noah&amp;#39;, &amp;#39;William&amp;#39;, &amp;#39;James&amp;#39;, ...&lt;/span&gt;
  296. &lt;span class="n"&gt;spelling&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;&amp;quot; &amp;quot;&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  297. &lt;/pre&gt;&lt;/div&gt;
  298.  
  299. &lt;p&gt;Now, instead of &lt;code&gt;corrected = str(TextBlob(typo).correct())&lt;/code&gt; we do &lt;code&gt;result = spelling.suggest(typo)[0][0]&lt;/code&gt; as demonstrated here:&lt;/p&gt;
  300. &lt;div class="highlight"&gt;
  301.  
  302. &lt;pre&gt;&lt;span class="n"&gt;correct&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;typo&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_random_name&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  303. &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spelling&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;suggest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;typo&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  304. &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  305. &lt;span class="n"&gt;right&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;correct&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
  306. &lt;span class="o"&gt;...&lt;/span&gt;
  307. &lt;/pre&gt;&lt;/div&gt;
  308.  
  309. &lt;p&gt;So, let's compare the two "side by side" and see how this works out. Here's the output of running with 20 randomly selected names:&lt;/p&gt;
  310. &lt;pre&gt;▶ python test.py
  311. UNTRAINED...
  312. ORIGIN         TYPO           RESULT         WORKED?
  313. juan           jaun           juan           Yes!
  314. ethan          etha           the            Fail
  315. bryson         brysn          bryan          Fail
  316. hudson         hudsn          hudson         Yes!
  317. oliver         roliver        oliver         Yes!
  318. ryan           rnyan          ran            Fail
  319. cameron        caeron         carron         Fail
  320. christopher    hristopher     christopher    Yes!
  321. elias          leias          elias          Yes!
  322. xavier         xvaier         xvaier         Fail
  323. justin         justi          just           Fail
  324. leo            lo             lo             Fail
  325. adrian         adian          adrian         Yes!
  326. jonah          ojnah          noah           Fail
  327. calvin         cavlin         calvin         Yes!
  328. jose           joe            joe            Fail
  329. carter         arter          after          Fail
  330. braxton        brxton         brixton        Fail
  331. owen           wen            wen            Fail
  332. thomas         thoms          thomas         Yes!
  333. Right 40.0% of the time
  334.  
  335. TRAINED...
  336. ORIGIN         TYPO           RESULT         WORKED?
  337. landon         landlon        landon         Yes
  338. sebastian      sebstian       sebastian      Yes
  339. evan           ean            ian            Fail
  340. isaac          isaca          isaac          Yes
  341. matthew        matthtew       matthew        Yes
  342. waylon         ywaylon        waylon         Yes
  343. sebastian      sebastina      sebastian      Yes
  344. adrian         darian         damian         Fail
  345. david          dvaid          david          Yes
  346. calvin         calivn         calvin         Yes
  347. jose           ojse           jose           Yes
  348. carlos         arlos          carlos         Yes
  349. wyatt          wyatta         wyatt          Yes
  350. joshua         jsohua         joshua         Yes
  351. anthony        antohny        anthony        Yes
  352. christian      chrisian       christian      Yes
  353. tristan        tristain       tristan        Yes
  354. theodore       therodore      theodore       Yes
  355. christopher    christophr     christopher    Yes
  356. joshua         oshua          joshua         Yes
  357. Right 90.0% of the time&lt;/pre&gt;
  358.  
  359. &lt;p&gt;See, with very little effort you can got from 40% correct to 90% correct.&lt;/p&gt;
  360. &lt;p&gt;Note, that the output of something like &lt;code&gt;spelling.suggest('darian')&lt;/code&gt; is actually a list like this: &lt;code&gt;[('damian', 0.5), ('adrian', 0.5)]&lt;/code&gt; and you can use that in your application. For example:&lt;/p&gt;
  361. &lt;pre&gt;&amp;lt;li&amp;gt;&amp;lt;a href=&amp;quot;?name=damian&amp;quot;&amp;gt;Did you mean &amp;lt;b&amp;gt;damian&amp;lt;/b&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt;
  362. &amp;lt;li&amp;gt;&amp;lt;a href=&amp;quot;?name=adrian&amp;quot;&amp;gt;Did you mean &amp;lt;b&amp;gt;adrian&amp;lt;/b&amp;gt;&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt;&lt;/pre&gt;
  363.  
  364. &lt;h3&gt;Bonus and conclusion&lt;/h3&gt;
  365. &lt;p&gt;Ultimately, what &lt;code&gt;TextBlob&lt;/code&gt; does is a re-implementation of &lt;a href="http://www.norvig.com/spell-correct.html"&gt;Peter Norvig's original implementation from 2007&lt;/a&gt;. I too, have &lt;a href="/plog/spellcorrector"&gt;written my own implementation in 2007&lt;/a&gt;. Depending on your needs, you can just figure out the licensing of that source code and lift it out and implement in your custom ways. But &lt;code&gt;TextBlob&lt;/code&gt; wraps it up nicely for you.&lt;/p&gt;
  366. &lt;p&gt;When you use the &lt;code&gt;textblob.en.Spelling&lt;/code&gt; class you have some choices. First, like I did in my demo:&lt;/p&gt;
  367. &lt;div class="highlight"&gt;
  368.  
  369. &lt;pre&gt;&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;&amp;quot;spelling-model.txt&amp;quot;&lt;/span&gt;
  370. &lt;span class="n"&gt;spelling&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Spelling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  371. &lt;span class="n"&gt;spelling&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_space_separated_text_blob&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  372. &lt;/pre&gt;&lt;/div&gt;
  373.  
  374. &lt;p&gt;What that does is &lt;em&gt;creating&lt;/em&gt; a file &lt;code&gt;spelling-model.txt&lt;/code&gt; that wasn't there before. It looks like this (in my demo):&lt;/p&gt;
  375. &lt;pre&gt;▶ head spelling-model.txt
  376. aaron 1
  377. abel 1
  378. adam 1
  379. adrian 1
  380. aiden 1
  381. alexander 1
  382. andrew 1
  383. angel 1
  384. anthony 1
  385. asher 1&lt;/pre&gt;
  386.  
  387. &lt;p&gt;The number (on the right) there is the "frequency" of the word. But what if you have a "scoring" number of your own. Perhaps, in your application you just know that &lt;code&gt;adrian&lt;/code&gt; is more right than &lt;code&gt;damian&lt;/code&gt;. Then, you can make your own file:&lt;/p&gt;
  388. &lt;p&gt;Suppose the text file ("spelling-model-weighted.txt") contains lines like this:&lt;/p&gt;
  389. &lt;pre&gt;...
  390. adrian 8
  391. damian 3
  392. ...&lt;/pre&gt;
  393.  
  394. &lt;p&gt;Now, the output becomes:&lt;/p&gt;
  395. &lt;pre&gt;&amp;gt;&amp;gt;&amp;gt; import os
  396. &amp;gt;&amp;gt;&amp;gt; from textblob.en import Spelling
  397. &amp;gt;&amp;gt;&amp;gt; import os
  398. &amp;gt;&amp;gt;&amp;gt; path = &amp;quot;spelling-model-weighted.txt&amp;quot;
  399. &amp;gt;&amp;gt;&amp;gt; assert os.path.isfile(path)
  400. &amp;gt;&amp;gt;&amp;gt; spelling = Spelling(path=path)
  401. &amp;gt;&amp;gt;&amp;gt; spelling.suggest(&amp;#x27;darian&amp;#x27;)
  402. [(&amp;#x27;adrian&amp;#x27;, 0.7272727272727273), (&amp;#x27;damian&amp;#x27;, 0.2727272727272727)]&lt;/pre&gt;
  403.  
  404. &lt;p&gt;Based on the weighting, these numbers add up. I.e. 3 / (3 + 8) == 0.2727272727272727&lt;/p&gt;
  405. &lt;p&gt;I hope it inspires you to write your own spelling application using &lt;code&gt;TextBlob&lt;/code&gt;.&lt;/p&gt;
  406. &lt;p&gt;For example, you can feed it the names of your products on an e-commerce site. The &lt;code&gt;.txt&lt;/code&gt; file might bloat if you have too much but note that the 30K lines &lt;code&gt;en-spelling.txt&lt;/code&gt; is only 314KB and it loads in...:&lt;/p&gt;
  407. &lt;pre&gt;&amp;gt;&amp;gt;&amp;gt; from textblob import TextBlob
  408. &amp;gt;&amp;gt;&amp;gt; from time import perf_counter
  409. &amp;gt;&amp;gt;&amp;gt; b = TextBlob(&amp;quot;I havv goood speling!&amp;quot;)
  410. &amp;gt;&amp;gt;&amp;gt; t0 = perf_counter(); right = b.correct() ; t1 = perf_counter()
  411. &amp;gt;&amp;gt;&amp;gt; t1 - t0
  412. 0.07055813199999861&lt;/pre&gt;
  413.  
  414. &lt;p&gt;...70ms for 30,000 words.&lt;/p&gt;</description><pubDate>Fri, 23 Aug 2019 14:52:47 +0000</pubDate><guid>https://www.peterbe.com/plog/train-your-own-spell-corrector-with-textblob</guid></item><item><title>function expandFiles(directoriesPatternsOrFiles)</title><link>https://www.peterbe.com/plog/function-expandfiles</link><description>expandFiles is a function that is useful for clis that finds files</description><pubDate>Thu, 15 Aug 2019 14:10:59 +0000</pubDate><guid>https://www.peterbe.com/plog/function-expandfiles</guid></item><item><title>A React vs. Preact case study for a widget</title><link>https://www.peterbe.com/plog/react-vs-preact-case-study-for-a-widget</link><description>tl;dr; The previous (React) total JavaScript bundle size was: 36.2K Brotli compressed. The new (Preact) JavaScript bundle size was: 5.9K. I.e. 6 times smaller. Also, it appears to load faster in WebPageTest.</description><pubDate>Wed, 24 Jul 2019 15:44:59 +0000</pubDate><guid>https://www.peterbe.com/plog/react-vs-preact-case-study-for-a-widget</guid></item></channel></rss>

If you would like to create a banner that links to this page (i.e. this validation result), do the following:

  1. Download the "valid RSS" banner.

  2. Upload the image to your own server. (This step is important. Please do not link directly to the image on this server.)

  3. Add this HTML to your page (change the image src attribute if necessary):

If you would like to create a text link instead, here is the URL you can use:

http://www.feedvalidator.org/check.cgi?url=https%3A//www.peterbe.com/rss.xml

Copyright © 2002-9 Sam Ruby, Mark Pilgrim, Joseph Walton, and Phil Ringnalda