A Search Engine for Japanese Retail
Miquel Puig examines the challenges of building a search engine for non-western languages with insights from a recent Japanese project.
As part of our partnership with Aeon in Japan, we are providing an online shopping solution. While for some teams a new retail partner might only require a change in configuration or deployment pipelines, for the search teams, it's a huge task if it involves an entirely new language, let alone one as different from the ones we already support as Japanese.
At first glance, this might not seem like a big issue. It should be just a matter of using that language in the search engine, right? Well, there's actually much more to it than that. In this post, I'll try to explain the different pieces we had to put together to make our search solution work in Japanese.
The Japanese language
To understand the different problems we had to solve, we need to take a quick dive into the Japanese language, particularly two characteristics that make it very different from the many European languages we've dealt with so far.
Japanese is written in four different scripts: Hiragana (ひらがな), Katakana (カタカナ), Kanji (漢字), and Rōmaji. Hiragana and Katakana are two syllabaries, with each symbol representing a consonant + vowel combination, and individual symbols do not carry any meaning. Hiragana is used to write some native words, grammatical functions, and to help with reading lesser-known Kanji or in learner's texts. Katakana is used to write some native words, transliterate foreign words (or Japanese words of foreign origin), as well as for emphasis. Kanji is a vast collection of ideograms where each symbol carries some meaning and can be read in different ways depending on the context. Kanji was originally borrowed from Chinese. Finally, Rōmaji is not a proper script but is widely used in some contexts and is just the practice of writing the Japanese language in the Latin script.
Japanese not only has four different scripts but also has important usage considerations. Words can potentially be written in any of the four scripts, usually with one form being more common than the others, but all still being equally valid.
Take for instance the word for “can” (as in the container for food or beverages). In Japanese the word is “kan”, and it would be usually written as 缶 (in kanji), but could also be written as かん / カン (in hiragana / katakana) or even “kan” (in rōmaji) in some contexts.
Moreover, a typical Japanese sentence will contain a mixture of the different scripts, as can be seen in the following example of a product name:
In this example, the part in blue is written in katakana and reads “toppu baryu besuto puraisu” and is a brand name, an adaptation of the English “top value best price”, the part in red is a place name (Furano, a town in the island of Hokkaido) and is written in kanji, the part in purple is a combination of kanji and katakana “nama bīru” (draft beer) and finally the part in yellow is just Latin script and Arabic numerals.
We wanted to be able to reuse (at least parts of) our current solution, so our design idea from the beginning was to create a module that applied a set of steps to our inputs so that they could be fed to our existing search solution with minimal changes to it.
Let’s see the different steps that we had to implement in order for our search solution to work in Japanese.
Single/Double byte characters
In the context of Japanese writing, single-byte and double-byte refer to the number of bytes needed to represent a character in a computer system. Hiragana and katakana have a limited number of characters that can be represented using a single byte, but Kanji consists of thousands of characters and cannot be represented with a single byte.
Have a look at the two strings in the following:
They might seem the same with maybe different font sizes, but they are actually the same font size and represent the same text, the difference is which version of the character each one is using: half-width or full-width.
This might not seem a huge problem, but search engines are based on text-matching, which would not work across different-width characters, because even if they look similar, they are actually represented as different values. In the following table you can see each string and its translation to unicode values, which are clearly different.
In practice, Kanji and hiragana only have full-width versions, but the other scripts (latin letters, numbers and symbols; as well as katakana) have both versions. Users can potentially type in any using any of the two character versions, and they should be recognised as being equal.
In order to fix this, our solution was to introduce a filter that normalises each character into its “preferred” representation. Our decision for preferred representation was double-byte for “native” Japanese scripts (Kanji, hiragana and katakana) and single byte for the rest.
As we saw in a previous example, a single word may be written in any of the scripts. These differently written words need to be treated as equal, even if their internal representation as strings is different, as in the single double byte case.
In the image we can see a potential example of this case: the information updated by the retailer in the catalogue might contain a “milk” product that is written in its kanji form (牛乳), but the user might choose to search for it in hiragana (ぎゅうにゅう). These are two completely different strings as far as the search engine is concerned, but we need a solution to make them match.
In order to better understand our solution to this problem, let’s first look at how the different scripts relate to each other:
Both kana syllabaries (hiragana and katakana) have a direct 1:1 mapping with each other (eg. hiragana む and katakana ム are equivalent and both represent the syllable “mu”).
There is also a 1:1 mapping with their latin representation, if they follow the same transliteration rules in both ways (eg. as before む / ム and “mu”).
Kanji can be translated to one or more kana symbols that represent how that kanji is read in that context. For example, the kanji for “fish” is 魚 and is read as “sakana”. It can be written in hiragana as さかな, also read as “sakana”.
The most significant problem arises when trying to transliterate from kana to kanji, as Japanese is a highly homophonic language and a given sequence of kana could represent different kanji and mean different things. For example, the kana さ (sa) and け (ke) form さけ (sake) which in Japanese can either mean “alcohol” (and be written 酒) or “salmon” (and be written 鮭). Which meaning is intended by the user depends on the context, which is not available to the search engine.
Our solution to this problem is to convert all strings (both the ones provided by the retailer in the catalogue and the user queries) to a single representation before performing the search. Based on the relationships outlined above, our decision was to reduce everything to katakana, as all other scripts have a 1:1 translation to it (assuming rōmaji is correctly formed). To achieve this, we use a widely adopted library called Kuromoji which provides the katakana form of an input string.
When a user enters a search query, text matching is usually not performed to the whole string. Instead, the string is broken down into smaller pieces called tokens. In languages that use a variation of the Latin script, like English, we can simply separate tokens by whitespace. However, Japanese does not use whitespace to separate words, making it more difficult to break a string down into tokens.
To better understand this problem, imagine if English had no whitespace between words. We could have a query like “buttercupcake” that could both be interpreted as either “buttercup cake” or “butter cupcake”. Humans can usually determine which one is more likely based on context, but a computer lacks this context. A more realistic example in Japanese is the string 東京都 (read tō-kyō-to), which can be split as either 東京・都 (“tōkyō to”, meaning the Tokyo metropolitan area), or 東・京都 (“tō kyōto”, meaning east Kyoto).
To solve this problem, we need to introduce a dictionary that provides all possible words in the language and the likelihood that any two given tokens appear together. The diagram below illustrates this using the two examples mentioned.
In this image we can see all the possible paths that can be taken in order to split both strings. The important thing to notice here is that the paths between potential tokens have weights, corresponding to the likelihood of these two tokens being one after the other. In the end, the path with the highest accumulated value would be the chosen one, “butter cupcake” in the first example and 東京・都 in the second. The library that we use for transliteration (Kuromoji) also provides us with an implementation of this algorithm.
The last piece of the puzzle for us was making synonyms work in Japanese. Synonyms is a feature that we already provided in our search solution, with which a retailer can set up pairs of strings that have a similar meaning but would otherwise not be counted as such (in English think for instance of “soda” and “soft drink” or “cookies” and “biscuits”). With a synonym, the original search term will be enriched with the synonym’s tokens, thus broadening the search. In Japanese we have a particularly important use case for this feature and that is concepts that have both a traditional japanese word and an imported, adapted word (usually from English and written in katakana).
In the next diagram we can see an example of this. The traditional word for “milk” is, as we have seen before, 牛乳; but there is also the word ミルク (miruku), which is an adaptation of the English “milk”. If a retailer was to create a synonym between 牛乳 and ミルク we should still be able to make it work for any of the other possible representations of either word, so if a user searched for ミルク (in katakana) we should still be able to match it with the reduced version of the original name from the catalogue, which would be 牛乳 in katakana (ギュウニュウ) as this is what we reduce to (see above).
To achieve this, we only need to reduce the synonym before using it as we did with the search query and the catalogue information. This enables us to maintain the synonym in its original form as introduced by the retailer, allowing them to input synonyms in any representation. Additionally, it allows us to use the full synonym capabilities of our platform while fulfilling the requirement of cross-script matching for Japanese.
The main learning that we take away from this development work is that search is very language-dependent and that non-western languages may work in vastly different ways to what we are used to. We should be prepared for that and not underestimate the effort needed. Having a person with knowledge of the language, even if limited, has proven to be a huge advantage for the whole team. We also think we are in a much better position now, even if further languages are very different, because our platform architecture has been improved, with the aim of being able to reuse as much as possible with minimal additional work.
Change your world with us
Across Ocado Technology, we have a diverse, rich mix of teams and expertise working to solve complex problems. Learn more about our full range of opportunities here.