Posts Tagged Bosnain
Character Frequency Analysis of Bosnian + Source + Report
Posted by Ajith Srikukan in CSE, for everyone, Projects on November 9, 2010
This post analyzes Bosnian language character frequency.
[Attachments : PDF report + Source code of Bosnian Language Frequency Analyzer]
About Bosnian
Bosnia and Herzegovina are the countries using Bosnian as their official language and more than 2.2 million people speak this language all around the world.
It has 5 vowels characters like English language and 25 consonants, altogether 30 characters.
Vowels: i , e , a , o , u
Consonants: b , c , č , ć , d , dž , đ , f , g , h , j , k , l , lj , m , n , nj , p , r , s , š , t , v , z , ž
In this language there are some characters represented by combining 2 characters, such as ‘dž’ by combining ‘d’ and ‘ž’ where ‘d’ and ‘ž’ have their own definition, at the same time character ‘dž’ also has its own definition like other characters.
Character Frequency Analysis
| Character | Count | % | Character | Count | % | |
| a | 3928 | 11.57064 | a | 3928 | 11.57064 | |
| b | 453 | 1.334394 | i | 3684 | 10.85189 | |
| c | 335 | 0.986803 | o | 3068 | 9.037352 | |
| č | 376 | 1.107576 | e | 2904 | 8.554259 | |
| ć | 209 | 0.615647 | n | 1960 | 5.773536 | |
| d | 977 | 2.877931 | j | 1933 | 5.694003 | |
| dž | 25 | 0.073642 | s | 1856 | 5.467185 | |
| đ | 36 | 0.106045 | k | 1651 | 4.86332 | |
| e | 2904 | 8.554259 | r | 1571 | 4.627666 | |
| f | 90 | 0.265111 | t | 1356 | 3.994344 | |
| g | 494 | 1.455167 | u | 1337 | 3.938376 | |
| h | 247 | 0.727583 | v | 1239 | 3.649699 | |
| i | 3684 | 10.85189 | m | 1020 | 3.004595 | |
| j | 1933 | 5.694003 | d | 977 | 2.877931 | |
| k | 1651 | 4.86332 | l | 878 | 2.586309 | |
| l | 878 | 2.586309 | p | 736 | 2.168022 | |
| lj | 148 | 0.435961 | z | 686 | 2.020738 | |
| m | 1020 | 3.004595 | g | 494 | 1.455167 | |
| n | 1960 | 5.773536 | b | 453 | 1.334394 | |
| nj | 267 | 0.786497 | č | 376 | 1.107576 | |
| o | 3068 | 9.037352 | c | 335 | 0.986803 | |
| p | 736 | 2.168022 | š | 325 | 0.957347 | |
| r | 1571 | 4.627666 | nj | 267 | 0.786497 | |
| s | 1856 | 5.467185 | h | 247 | 0.727583 | |
| š | 325 | 0.957347 | ć | 209 | 0.615647 | |
| t | 1356 | 3.994344 | ž | 159 | 0.468363 | |
| u | 1337 | 3.938376 | lj | 148 | 0.435961 | |
| v | 1239 | 3.649699 | f | 90 | 0.265111 | |
| z | 686 | 2.020738 | đ | 36 | 0.106045 | |
| ž | 159 | 0.468363 | dž | 25 | 0.073642 |
Table 1: Character frequency output Table 2 : Character frequency sorted output
The first table shows the character frequency analysis of Bosnian language produced by the program. The second table shows the character frequency analysis of Bosnian language that sorted from highest to lowest value by analyzing frequency of each character.
Figure 1 Character frequency analysis of Bosnian – bar chart – count vs. characters
If we analyze the 2nd table we can say that the mostly used characters out of 30 are a, e, i, o. This shows that the characters used to construct words are taken from vowels. And character ‘u’ has lower usage compare to some consonants, ‘n’, ‘j’, ’s’, ‘k’, ‘r’. Like most of the other language including English this language uses vowel characters more frequently.
If we consider the first 19 places we can say that the characters filled first 19 places have the same character symbol like English language and the most special characters of Bosnian language got lower places.
Figure 2 Character frequency analysis of Bosnian – pie chart
The above pie chart represents the overall character distribution of Bosnian Language. In this chart all characters are arranged in ascending order based on percentage of character frequency.
If we count the percentage of all characters we can conclude that, this language uses 44 % vowel characters and the remaining 56 % are consonants. This shows that 17 % character set (vowels) takes 44 % of space when we construct words/sentences using this language.
About Source code – I used NetBeans to develop the program using Java technology. Also it contains some sample Bosnian text files that used to analyze the character frequency of Bosnian using this program.







