Developing the Corpus of Minangkabau Language: Insights, Challenges, and Future Directions

Main Article Content

Handoko Handoko https://orcid.org/0000-0003-2474-3821

Keywords

Minangkabau corpus, language documentation, language preservation, corpus methodology, digital resources

Abstract

This paper discusses the design for developing the Minangkabau language corpus, especially regarding the opportunities and challenges. The corpus development of Minangkabau is a crucial project to document, preserve, and revive the treasure trove of culture within the language. The availability of a Minangkabau language corpus can open opportunities for more intensive research on the Minangkabau language with a more modern and data-based approach. It can also encourage the development of Minangkabau corpus-based teaching materials. The corpus is manually assembled using various sources’ comprehensive data collection, annotation, and curation pipelines. These may be manuscripts, books, newspapers, or other written texts and spontaneous conversations, such as interviews or public speeches. Multimedia resources, such as television and radio broadcasts, audio-video recordings, and social media content, also add to the diversity of data gathered. The availability of accessible digital sources, such as online videos, online radio programs, and ebooks, can make data collection easier. However, several challenges may appear in developing the Minangkabau language corpus, such as limited technology accessibility, dialect variations, and the involvement of highly skilled human resources. This paper explains some opportunities for developing the Minangkabau language corpus and increasing the role of the corpus in revitalizing and documenting the Minangkabau language. Furthermore, the availability of the Minangkabau language corpus can also be a starting point for developing linguistic technology, such as voice recognition, text-to-speech, and natural language processing.

Downloads

Download data is not yet available.

References

Abdullah, T. (1970). Some notes on the Kaba Tjindua Mato: An example of Minangkabau traditional literature. Indonesia, 9, 1. https://doi.org/10.2307/3350620
Adelaar, K. A. (1992). Proto malayic: The reconstruction of its phonology and parts of its lexicon and morphology.
Adeyeye, B., Amodu, L., Odiboh, O., Oyesomi, K., Adesina, E., & Yartey, D. (2021). Agricultural radio programmes in indigenous languages and agricultural productivity in North-Central Nigeria. Sustainability, 13(7), 3929. https://doi.org/10.3390/su13073929
Aijmer, K., & Altenberg, B. (2004). Advances in corpus Linguistics.
Al-Hamzi, A. M. S., Gougui, A., Amalia, Y. S., & Suhardijanto, T. (2020). Corpus linguistics and corpus-based research and its implication in applied linguistics: A systematic review. PAROLE Journal of Linguistics and Education, 10(2), 176–181. https://doi.org/10.14710/parole.v10i2.176-181
Almos, R., & Ladyanna, S. (2019). Lexicons classics of fishing in Minangkabau community. In Sciendo eBooks (pp. 230–235). https://doi.org/10.2478/9783110680027-033
Aman, I., Jaafar, M. F., & Awal, N. M. (2019). Language and identity: A reappraisal of Negeri Sembilan Malay language. Kajian Malaysia, 37(1), 27–49. https://doi.org/10.21315/km2019.37.1.2
Amir, A., Zuriati, & Anwar, K. (2006). Pemetaan sastra lisan Minangkabau.
Ancarno, C. (2018). Interdisciplinary approaches in corpus linguistics and CADS. In Routledge eBooks (pp. 130–156). https://doi.org/10.4324/9781315179346-7
Anthony, N. L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141–161. https://doi.org/10.17250/khisli.30.2.201308.001
Antoni, C., Irham, I., & Ronsi, G. (2019). Language variation in Minang colloquial language spoken in Kabun region: Sociolinguistic study on millennial citizens. Jurnal Arbitrer, 6(2), 92–98. https://doi.org/10.25077/ar.6.2.92-98.2019
Anwar, K. (1980). Language use in Minangkabau society. Indonesia Circle School of Oriental & African Studies Newsletter, 8(22), 55–63. https://doi.org/10.1080/03062848008723789
Arkhangelskiy, T. (2019). Corpora of social media in minority Uralic languages. In Proceedings of the Fifth International Workshop on Computational Linguistics for Uralic Languages. https://doi.org/10.18653/v1/w19-0311
Barlow, M. (2011). Corpus linguistics and theoretical linguistics. International Journal of Corpus Linguistics, 16(1), 3–44. https://doi.org/10.1075/ijcl.16.1.02bar
Barmenqulova, A. (2024). Transfer of ethnocultural units stored in the regional lexicon to the national corpus. Tiltanym, 2, 193–200. https://doi.org/10.55491/2411-6076-2024-2-193-200
Bennett, G. (2010). Using Corpora in the language learning classroom.
Bhatia, V. K., Flowerdew, J., & Jones, R. H. (2008). Advances in discourse studies.
Bird, S., & Simons, G. (2003). Seven dimensions of portability for language documentation and description. Language, 79(3), 557–582. https://doi.org/10.1353/lan.2003.0149
Boulton, A., & Landure, C. (2016). Using Corpora in language teaching, learning and use. Recherche Et Pratiques PéDagogiques En Langues De SpéCialité, Vol. 35 N° 2. https://doi.org/10.4000/apliut.5433
Breyer, Y. (2009). Learning and teaching with corpora: reflections by student teachers. Computer Assisted Language Learning, 22(2), 153–172. https://doi.org/10.1080/09588220902778328
Chelliah, S. L., & De Reuse, W. J. (2011). Handbook of descriptive linguistic fieldwork.
Coats, S., & Laippala, V. (2024). Linguistics across disciplinary borders.
Conrad, S. (2002). 4. Corpus linguistic approaches for discourse analysis. Annual Review of Applied Linguistics, 22, 75–95. https://doi.org/10.1017/s0267190502000041
Cushing, S. T. (2017). Corpus linguistics in language testing research. Language Testing, 34(4), 441–449. https://doi.org/10.1177/0265532217713044
Deignan, A. (2005). Metaphor and corpus linguistics.
Dimitrova, L., & Garabík, R. (2012). Bilingual corpus – digital repository for preservation of language heritage. Digital Presentation and Preservation of Cultural and Scientific Heritage, 2, 132–141. https://doi.org/10.55630/dipp.2012.2.5
Felde, O. V. (2022). Electronic corpus of linguaculture of the Northern Angara Region: Foundations, structure, and application. Bulletin of Kemerovo State University, 23(4), 1086–1095. https://doi.org/10.21603/2078-8975-2021-23-4-1086-1095
Flowerdew, L. (2011). Corpora and language education.
Ghani, R. A., Zakaria, M. S., & Omar, K. (2009). Jawi-Malay transliteration. In 2009 International Conference on Electrical Engineering and Informatics. https://doi.org/10.1109/iceei.2009.5254799
Grenoble, L. A., & Whaley, L. J. (2006). Saving Languages: An Introduction to Language Revitalization.
Hajič, J., Hajičová, E., Mírovský, J., & Panevová, J. (2016). Linguistically annotated corpus as an invaluable resource for advancements in linguistic research: a case study. The Prague Bulletin of Mathematical Linguistics, 106(1), 69–124. https://doi.org/10.1515/pralin-2016-0012
Handoko, H., Kaur, S., & Kia, L. S. (2024). Cultivating sustainability: A cultural linguistic study of Minangkabau environmental proverbs. Jurnal Arbitrer, 11(1), 72–84. https://doi.org/10.25077/ar.11.1.72-84.2024
Hanif, A., Afrina, C., Putra, H., & Rudiamon, S. (2022). Investigating Minangkabau’s scattered manuscript: Philological studies of religious manuscripts in West Sumatera. Proceedings of the 6th Batusangkar International Conference, BIC 2021, 11 - 12 October, 2021, Batusangkar-West Sumatra, Indonesia. https://doi.org/10.4108/eai.11-10-2021.2319433
Harun, M. H., Aziz, M. K. N. A., Rahim, E. a. A., Shuhairimi, A., & Ahmad, Y. (2018). Jawi writing in Malay archipelago manuscript: A general overview. MATEC Web of Conferences, 150, 05054. https://doi.org/10.1051/matecconf/201815005054
Heni, A. N., & Suryadi, M. (2022). Variasi leksikal bahasa Minangkabau di Kanagarian Kubang Putiah, Kabupaten Agam: Kajian sosiodialektologi. Widyaparwa, 50(1), 151–161. https://doi.org/10.26499/wdprw.v50i1.911
Herbowo, N. a. S., & Sulastri, S. (2020). Reprinting of Kaba and Tambo books by Kristal Multimedia Publisher. Wanastra Jurnal Bahasa Dan Sastra, 12(2), 223–228. https://doi.org/10.31294/w.v12i2.8744
Himmelmann, N. P. (2006). Essentials of language documentation.
Honkola, T., Ruokolainen, K., Syrjänen, K. J. J., Leino, U., Tammi, I., Wahlberg, N., & Vesakoski, O. (2018). Evolution within a language: environmental differences contribute to divergence of dialect groups. BMC Evolutionary Biology, 18(1). https://doi.org/10.1186/s12862-018-1238-6
Jamaris, E. (2002). Pengantar sastra rakyat Minangkabau.
Juško-Štekele, A., & Kļavinska, A. (2024). Developing corpus literacy: A perspective of Latgalian language and cultural studies. Journal of Multilingual and Multicultural Development, 1–13. https://doi.org/10.1080/01434632.2024.2359020
Koto, F., & Koto, I. (2020). Towards computational linguistics in Minangkabau language: Studies on sentiment analysis and machine translation. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation (pp. 138–148). Association for Computational Linguistics.
Kretzschmar, W. A., Anderson, J., Beal, J. C., Corrigan, K. P., Opas-Hänninen, L. L., & Plichta, B. (2006). Collaboration on corpora for regional and social analysis. Journal of English Linguistics, 34(3), 172–205. https://doi.org/10.1177/0075424206293598
Kytö, M. (2011). Corpora and historical linguistics. Revista Brasileira De Lingüística Aplicada, 11(2), 417–457. https://doi.org/10.1590/s1984-63982011000200007
Lane, P., Hagen, K., Nøklestad, A., & Priestley, J. (2022). Creating a corpus for Kven, a minority language in Norway. Nordlyd, 46(1). https://doi.org/10.7557/12.6345
Leech, G. (1991). The state of the art in corpus linguistics. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics (pp. 8-29). Longman.
Litosseliti, L. (2018). Research methods in linguistics.
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2022). The spoken BNC2014. International Journal of Corpus Linguistics, 319–344. https://doi.org/10.1075/ijcl.22.3.02lov
McGinn, R. (2009). Studies in Austronesian languages and cultures: Papers in honor of René van den Berg.
Marnita, R. (2017). Pergeseran bahasa dan identitas sosial dalam masyarakat Minangkabau Kota: Studi kasus di kota padang. Masyarakat Indonesia, 37(1), 139–163. https://doi.org/10.14203/jmi.v37i1.607
Maryelliwati, M., Rahmat, W., & Kemal, E. (2018). A reality of Minangkabau language and literature and its transformation to a creation of performance works. Gramatika STKIP PGRI Sumatera Barat, 4(1). https://doi.org/10.22202/jg.2018.v4i1.2422
McEnery, T., & Xiao, R. (2011). What Corpora can offer in language teaching and learning.
Meyer, C. F. (2002). English Corpus Linguistics.
Migge, B., & Léglise, I. (2010). Integrating local languages and cultures into the education system of French Guiana. In Creole language library (pp. 107–132). https://doi.org/10.1075/cll.36.05mig
Musgrave, S. (2014). Language documentation and sociolinguistics: Capturing variation in discourse. Language Documentation & Conservation, 8, 121–136.
Mustafa, F., & Yusuf, S. B. (2021). Transitivity of try and V construction in British and American English. Langkawi Journal of the Association for Arabic and English, 7(2), 197. https://doi.org/10.31332/lkw.v7i2.3166
Nathan, D., & Austin, P. K. (2004). Reconceiving metadata: Language documentation through thick and thin. In P. K. Austin (Ed.), Language documentation and description (Vol. 2, pp. 179–187). SOAS.
Nelisa, M., Ardoni, N., & Rasyid, Y. (2021). Preservation of Minangkabau local wisdom as media for cultural literacy. Advances in Social Science, Education and Humanities Research/Advances in Social Science, Education and Humanities Research. https://doi.org/10.2991/assehr.k.211201.024
Nesti, M. R. (2016). Variasi leksikal bahasa minangkabau di Kabupaten Pesisir Selatan. Jurnal Arbitrer, 3(1), 46–61. https://doi.org/10.25077/ar.3.1.46-61.2016
Noranda, A. (2023). Minangkabo dalam naskah kuno. Jurnal Ceteris Paribus, 2(2), 37–66. https://doi.org/10.25077/jcp.v2i2.18
Novita, R., Firdaus, W., & Budiono, S. (2021). Minangkabau language mapping verification in West Sumatra Province. Advances in Social Science, Education and Humanities Research/Advances in Social Science, Education and Humanities Research. https://doi.org/10.2991/assehr.k.211226.056
Nurizzati, N., & Nasution, M. I. (2021). The profile of Kaba Si Tungga manuscript and the play script of Anggun Nan Tongga by Wisran Hadi: An overview of the transcription and transformation of Minangkabau oral literary texts. Advances in Social Science, Education and Humanities Research/Advances in Social Science, Education and Humanities Research. https://doi.org/10.2991/assehr.k.211201.031
Nurmukhamedov, U., & Sharakhimov, S. (2021). Corpus-based vocabulary analysis of English podcasts. RELC Journal, 54(1), 7–21. https://doi.org/10.1177/0033688220979315
O’Keeffe, A., & Farr, F. (2003). Using language corpora in initial teacher education: Pedagogic issues and practical applications. TESOL Quarterly, 37(3), 389. https://doi.org/10.2307/3588397
O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From corpus to classroom.
Onyenankeya, K., & Salawu, A. (2022). Community radio acceptance in rural Africa: The nexus of language and cultural affinity. Information Development, 39(3), 567–580. https://doi.org/10.1177/02666669211073458
Oriyama, K. (2010). Heritage language maintenance and Japanese identity formation: What role can schooling and ethnic community contact play? Heritage Language Journal, 7(2), 237–272. https://doi.org/10.46538/hlj.7.2.5
Oswari, T., Hastuti, E., & Chandra, R. (2020). Minangkabau language learning based on android application. In proceedings of the 4th International Conference on Arts Language and Culture (ICALC 2019). https://doi.org/10.2991/assehr.k.200323.033
Padang TV. (2023, January 20). Duduak Baselo - hukum hukum adat di Minangkabau [Video]. YouTube. https://www.youtube.com/watch?v=skKJowyVGbA
Peksoy, E. (2017). Corpus based authenicity analysis of language teaching course books. International Journal of Languages Education, 1(Volume 5 Issue 4), 287–307. https://doi.org/10.18298/ijlet.2324
Philip, G. (2018). Corpus linguistics.
Poku, F. A. (2024). Linguistics of Ghanaian language: A platform to embed formal education in culture. International Journal of Research and Innovation in Social Science, VIII(III), 1337–1346. https://doi.org/10.47772/ijriss.2024.803098
Pramono, P., Yusuf, M., & Hidayat, H. N. (2018). Bahasa melayu dan Minangkabau dalam khazanah naskah Minangkabau. Jurnal Pustaka Budaya, 5(2), 24–35. https://doi.org/10.31849/pb.v5i2.1483
Rao, D. L., Pala, V. R., Herndon, N., & Gudivada, V. N. (2020). A deep learning architecture for corpus creation for Telugu language. In Advances in intelligent systems and computing (pp. 1–16). https://doi.org/10.1007/978-981-15-4029-5_1
Razin, T., & Subiyanto, A. (2024). Pola perubahan fonologi antara bahasa Minangkabau umum dan subdialek Minangkabau Selayo. Widyaparwa, 52(1), 206–220. https://doi.org/10.26499/wdprw.v52i1.1719
Reniwati, R., & Khanizar, K. (2022). Leksikon nama peralatan rumah tangga masyarakat Minangkabau: gambaran dinamika masyarakat. Ranah Jurnal Kajian Bahasa, 11(1), 141. https://doi.org/10.26499/rnh.v11i1.4169
Reniwati, R., Midawati, M., & Noviatri, N. (2017). Lexical variations of Minangkabau language within West Sumatra and Peninsular Malaysia: A dialectological study. Malaysian Journal of Society and Space, 13(3), 1–10. https://doi.org/10.17576/geo-2017-1303-01
Römer, U. (2011). Corpus research applications in second language teaching. Annual Review of Applied Linguistics, 31, 205–225. https://doi.org/10.1017/s0267190511000055
Rusmali, M., Usman, A. H., Nikelas, S., Husin, N., & Busri, B. (1985). Kamus Minangkabau-Indonesia.
Sakti, S., & Nakamura, S. (2013). Towards language preservation: Design and collection of graphemically balanced and parallel speech corpora of Indonesian ethnic languages. In International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE) 2013. https://doi.org/10.1109/icsda.2013.6709907
Saydam, G. (2004). Kamus lengkap bahasa Minang: Minang-Indonesia, Indonesia-Minang.
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of International Conference on New Methods in Language Processing.
Schreier, D. (2013). Collecting ethnographic and sociolinguistic data. In Research Methods in Language Variation and Change (pp. 17–35). https://doi.org/10.1017/cbo9780511792519.004
Singh, S. R., Anand, A., & Chauhan, S. (2023). Handwritten documents conversion to digital documents. In 2023 9th International Conference on Smart Computing and Communications (ICSCC). https://doi.org/10.1109/icscc59169.2023.10335069
Sneddon, J. N. (2003). The Indonesian language: Its history and role in modern society.
Stubbs, M., & Halbe, D. (2012). Corpus linguistics: Overview. The Encyclopedia of Applied Linguistics. https://doi.org/10.1002/9781405198431.wbeal0033
Sulistiyo, R., Sani, A., & Rusli, R. (2023). Manuskrip Beraksara Jawi pada Khazanah Pustaka EAP British Library. Ulumuddin Jurnal Ilmu-ilmu Keislaman, 13(1), 115–136. https://doi.org/10.47200/ulumuddin.v13i1.1625
Suryadi, S. (2010). The impact of the West Sumatran regional recording industry on Minangkabau oral literature. Wacana Journal of the Humanities of Indonesia, 12(1), 35. https://doi.org/10.17510/wjhi.v12i1.45
Suryani, E. (2018). The survival of local languages in Indonesia: A case study of Minangkabau and Sundanese. Asian Journal of Humanities and Social Studies, 6(1), 1-10.
TVRI Sumatera Barat. (2022, August 1). Limbago adaik Minangkabau di maso kini - budaya alam minangkabau TVRI Sumbar (full) [Video]. YouTube. https://www.youtube.com/watch?v=xJQmYlqzI5E
Taufiqurrahman, T., Hidayat, A. T., Efrinaldi, Sudarman, & Lukmanulhakim. (2021). The Existence of the Manuscript in Minangkabau Indonesia and its field in Islamic studies. Journal of Al-Tamaddun, 16(1), 125–138. https://doi.org/10.22452/jat.vol16no1.9
Tembe, J., & Norton, B. (2008). Promoting local languages in Ugandan Primary Schools: the community as stakeholder. Canadian Modern Language Review/ La Revue Canadienne Des Langues Vivantes, 65(1), 33–60. https://doi.org/10.3138/cmlr.65.1.33
Trosterud, T. (2002). Parallel corpora as tools for investigating and developing minority languages.
Velini, R. S., & Suryadi, M. (2023). Usaha pemertahanan Bahasa Minangkabau melalui permainan dan tradisi budaya lokal di Kota Padang, Sumatera Barat. Jurnal Sastra Indonesia, 12(1), 71–80. https://doi.org/10.15294/jsi.v12i1.59370
Vessey, R. (2015). Corpus approaches to language ideology. Applied Linguistics, 38(3), 277–296. https://doi.org/10.1093/applin/amv023
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: A Professional Framework for Multimodality Research. In Proceedings of LREC 2006.
Wray, A., & Bloomer, A. (2012). Projects in linguistics and language studies.
Xiao, R. (2009). 46. Theory-driven corpus research: Using corpora to inform aspect theory. In Corpus Linguistics: An International Handbook.
Zufferey, S. (2020). Introduction to Corpus Linguistics.