Ukubunjwa, Iikholeji neeyunivesithi
Yintoni Corpus Ezolimi?
Nje kwiminyaka embalwa eyadlulayo zokusebenzisa uphando kweelwimi, oosonzululwazi amaphupha kuphela. Lo msebenzi wenziwa ngesandla, etsala kukho inani elikhulu labafundi, kukho kubonakala kakhulu iimpazamo "akholosileyo", kwaye okona kubalulekileyo - konke oku wathabatha, ixesha elide elide.
Ngophuhliso zobugcisa computer iye yaba lula ukuqhuba uphando umyalelo yobukhulu ngokukhawuleza, yaye namhlanje omnye imiyalelo nathembisayo ekufundeni ulwimi a Ezolimi corpus. osemqoka yalo esentloko kukuba kukusetyenziswa oluninzi ulwazi kwisicatshulwa, ulwazi kwi-database enye, ngendlela ekhethekileyo, wabiza umzimba esiphawuliweyo.
Ukuza kuthi ga ngoku, kukho izakhiwo ezininzi kudalwe iinjongo ezahlukeneyo ngokusekelwe izinto ezahlukeneyo kweelwimi ukususela kwizigidi ukuya amashumi ezigidi leeyunithi lungelelanisa. Le indlela isaziwa njenge ethembisayo yaye ibonisa inkqubela ebonakalayo phambili isicelo kunye nophando iinjongo ezo. Iingcali, ngendlela enye okanye ngenye esebenza ulwimi zendalo, kucetyiswa ukuba aqhelane umzimba iitekisi ubuncinane kwinqanaba lesiseko.
History of corpus Ezolimi
Ukwenziwa oku kungenxa ekuyilweni United States umzimba Brown ngasekuqaleni-imi- 60 kwinkulungwane yokugqibela. Le ngqokelela iquka imibhalo yezigidi 1 zamagama, yaye namhlanje kulo mzimba woku ubungakanani bekuya kuba ezingonelisiyo ngokupheleleyo. Oku kwenzeke ubukhulu becala ngenxa isantya sophuhliso zobugcisa computer, kwakunye neemfuno ezikhulayo izibonelelo zophando entsha.
Kule 90 Ezolimi corpus yavela yaba uqeqesho epheleleyo nezizimeleyo, ingqokelela izicatshulwa ziye zeza kwaye oza ngeentlobo ezininzi zentetho. Kweli xesha kuye kwadalwa, umzekelo, azafumaneka iimpawu British National Corpus 100 million.
Kunye nophuhliso kule ndawo Ezolimi, nemiqulu kwisicatshulwa isiba ngakumbi nangakumbi (kwaye ukufikelela iibhiliyoni iiyunithi dictionary), kwaye layout kuya ohlukeneyo. Ukuza kuthi ga ngoku, isithuba Internet inokufumaneka kwezidumbu olubhaliweyo noluthethwayo, ngeelwimi ezininzi, yaye iincwadi obugcisa okanye imfundo zokufunda ebomini, kwakunye nezinye iintlobo ezininzi.
Zeziphi izindlu
Iintlobo Body kwi Ezolimi emzimbeni lunganikezelwa ngenxa yezizathu eziliqela. Ungaziva, isiseko ngokuhlelwa inokuba itekisi ulwimi (Russian, isiJamani), indlela yokufikelela (source evulekileyo, ivaliwe, zorhwebo), uhlobo izinto source (fiction, ukugcina, imfundo, ubuntatheli).
indlela umdla yenza izinto zolwimi oluthethwayo. Ukusukela ekubeni ukushicilelwa ngabom intetho enjalo ukudala imeko ngumntu ngenxa abaphendula, kwaye zezinto ezithile eziziphumo ayikwazanga kubizwa ngokuba "ezizenzekelayo", Ezolimi corpus mihla uye enye indlela. A ntliziyo lixhotyiswe microphone, emini wavelisa irekhodi yazo zonke iincoko, apho kuyo inxaxheba. Abantu emhlabeni, Kakade ke, kusenokwenzeka ukuba ngexa kwincoko yemihla ngemihla igalelo kuphuhliso yenzululwazi.
Kamva wafumana irekhodi igcinwe kwiziko ledatha yaye zikhatshwa uhlobo lombhalo umbhalo eprintiweyo. Ngoko, kuba lophawulo kunokwenzeka ukuze adale yonke imihla i izindlu intetho yomlomo.
isicelo
Naphi na apho kunokwenzeka ukusetyenziswa kolwimi, yaye mhlawumbi ukusetyenziswa kwezakhiwo izicatshulwa. Iindlela ukusebenzisa igobhogobho kwi Ezolimi kungaba:
- Ukudala inkqubo eqhamileyo, lisetyenziswa kwezopolitiko kunye neshishini ukugcina umkhondo kweempendulo ezintle nezimbi labavoti kunye nabathengi, ngokulandelelana kwazo.
- inkqubo ulwazi Udibaniso nesichazi magama kunye nabaguquleli ukuphucula intsebenzo yabo.
- A iindidi zemisebenzi uphando negalelo ekuqondeni yeyunithi ulwimi, imbali yophuhliso lwayo kunye noqikelelo olumalunga notshintsho kufutshane elizayo.
- Nophuhliso lweenkqubo retrieval inkcazelo esekelwe morphological, syntactic, lwesemantiki kunye nezinye iimpawu.
- Yokusebenzisa iinkqubo ezahlukeneyo kweelwimi kunye nabanye.
Ukusetyenziswa kwezakhiwo
interface resource efanayo kunye search engine eqhelekileyo, kwaye uyabakhokela umsebenzisi ufake igama okanye amagama ukukhangela iinkcukacha ezigcinwayo. Ngaphandle yakha umbuzo ngqo angasebenzisa uguqulelo eyandisiweyo, nto leyo evumela ukufumana ulwazi oye phantse nayiphi na imiqathango yolwimi.
isiseko ukufuna kungaba:
- ubulungu kwiqela elithile izigaba zentetho;
- iimpawu zegrama;
- semantics;
- imibala zesimbo nangokweemvakalelo.
Unako kwakhona ukudibanisa yokucinga ukuba kulandelelwano lwamagama, umzekelo, ukufumana zonke izenzeko yesenzi kwi eladlulayo, umntu wokuqala langoku isinye, eza emva isimelabizo "kwi" kwaye esibizo kwimeko wesenzo. Isisombululo umsebenzi onjalo elula ithatha umsebenzisi imizuzwana embalwa kwaye ifuna nje ezimbalwa unqakrazo lwe mouse kwi emasimini ekhankanyiweyo.
Inkqubo yokudala
Ukufunwa kukodwa kwenziwa kuzo zonke subcorpus omnye onyuliweyo ngqo, kuxhomekeka kwiimfuno ekuphumezeni iinjongo ezithile;
- Inyathelo lokuqala kukuchaza ukuba yeyiphi izicatshulwa lwenze isiseko tyala. Ukulungiselela iinjongo ezithile, ukuba usetyenziswa rhoqo journalistic, amabali iindaba, izimvo kwi-intanethi. Iprojekthi yophando na ukusetyenziswa iintlobo ngeentlobo ngeentlobo package, kodwa okubhaliweyo kufuneka zikhethwe ngokungqinelana okuthile enivumelana.
- Ukuqokelelwa ngenxa lweetekisi phantsi pretreatment, kukho nokulungisa iimpazamo, ukuba ikhona, elungiswe inkcazelo lombhalo kunye extra-neelwimi zetekisi.
- Iyapheliswa lonke ulwazi non-yobhalo: Icima nemizobo, imifanekiso, ii table.
- Ingaba isabelo-iimpawu, apho kuvame intetho, ukuze kuqhutywe.
- Okokugqibela, wazithwala Ukuba kwisininzi morphological, syntactical kunye nezinye iimpawu ndizuze lweziqalelo.
Isiphumo zonke iintengiselwano ezenziwa isakhiwo syntactic nge zisasazwe apho kukho ezininzi izinto, nganye ezichongiweyo yinxalenye yokuthetha, lwegrama kunye, kwezinye iimeko, iimpawu kwisemantiki.
Ubunzima ekudaleni izakhiwo
Kubalulekile ukuqonda oko akwanelanga ukuba kunye iseti yamagama okanye izivakalisi ukuze umzimba. Kwelinye icala, ingqokelela lweetekisi kufuneka elungeleleneyo, oko kukuthi, bamele iintlobo ezahlukeneyo zezicatshulwa kusenziwa ngokwemilinganiselo ethile. Kwelinye - imixholo ebiyelweyo kufuneka zishiyane ngendlela ekhethekileyo.
Ingxaki yokuqala isonjululwe isivumelwano: umzekelo, kule ngqokelela iquka 60% iitekisi zoncwadi,-20% of iidotyhumentari, ipesenti ethile unikwa esibhaliweyo ulwimi oluthethwayo, umthetho, imisebenzi yenzululwazi, njl ogqibeleleyo iresiphi umzimba Balanced namhlanje ayikho ...
Umbuzo wesibini, ngokusingisele ubeko umxholo, basombulule mngeni. Kukho iinkqubo ezizodwa kanye ubuchule esetyenziselwa oluzenzekelayo zokumakisha iitekisi, kodwa musa ukunika isiphumo ofezekileyo, kunokubangela ukuphazamiseka kwaye zifuna kwakhona ngesandla. Amathuba kunye nemingeni ejongene nale ngxaki zichazwe ngokucacileyo lephepha V. P. Zaharova of Ezolimi corpus.
Text Uphawulo kuphunyezwa kumanqanaba aliqela, oko thina uluhlu ngezantsi.
ithegi morphological
Esikolweni, sikhumbula ukuba ngolwimi Russian, kukho iindawo ezahlukeneyo zentetho, yaye ngamnye kubo uneempawu zawo. Umzekelo, isenzi lunamacandelo ze utyekelo kunye nexesha apho kungekho kwimo evumayo. a native ngaphandle kokuthandabuza ukuy izibizo kunye nezenzi udibaniso, kodwa ngeyokwahlukanisa umzimba ka-100 yezigidi. iimpawu umsebenzi wezandla ngeke isebenze. Yonke imisebenzi oluyimfuneko ukuphumeza ikhompyutha, nangona kunjalo, oku kufuneka ifundiswe.
ithegi Morphological, ikhompyutha kufuneka 'siqonde' igama ngalinye njengenxalenye ethile intetho ukuba iimpawu ezithile zegrama. Ekubeni Russian (kunye naluphi na olunye ulwimi) basebenzisa inani lwemithetho rhoqo, kunokwenzeka ukuba ukwakha inkqubo oluzenzekelayo ukwenzela uhlalutyo morphological, imali kule moto eziliqela ubuchule. Nangona kunjalo, kukho iimeko umthetho, kwakunye imiba eyahlukeneyo nzima. Ngenxa yoko, ikhompyutha uhlalutyo net of namhlanje zimbi, nkqu-4% Imposiso ivelisa ixabiso lika-4 mln. Amagama emzimbeni ka-100 yezigidi. Units, efuna kwakhona ngesandla.
Incwadi eneenkcukacha uchaza ingxaki Zaharova V. P. "Corpus Ezolimi".
umbala syntactic
Imposiso yenkcaza okanye uzama uku - inkqubo egqiba ubudlelwane kwamagama kwisivakalisi. Ukusebenzisa iseti algorithms kunokwenzeka ukuba ukujonga okubhaliweyo intloko, isivisa, izongezo, ubhekisa ezininzi zentetho. Khangela ukuba zeziphi magama ngokulandelelana eziphambili, kunye - luxhomekeke, singakwazi ngempumelelo luthathe ulwazi kwitekisi ukufundisa umatshini lokukhupha kuphendulwa isicelo search kuphela ulwazi nomdla kuthi.
Hi ndlela leyi, iinjini zokuphendla mihla sebenzisa oku ukuba amanani ezithile endaweni iitekisi elide ukuphendula imibuzo ezifanelekileyo ezifana "zingaphi calories kwi iapile" okanye "umgama ukusuka eMoscow ukuya eSt Petersburg." Noko ke, ukuze sikuqonde izisekelo inkqubo echazwe yimfuno lokuthethana i "Introduction to the Corpus Ezolimi" okanye ezinye tutorial ezisisiseko.
womshicileli kwisemantiki
Le semantics elithi - kukuthi, ngendlela elula, intsingiselo. indlela ngokubanzi esebenzayo uhlalutyo lwesemantiki i tags igama ukufanisa, ibonisa yelo yakhe iseti kweendidi yesemantiki kunye subcategories. Loo nkcazelo esibalulekileyo enyusa ubuchule ukuhlalutya ithowuni itekisi, summarization esizenzekelayo nezinye iindlela imisebenzi ye Ezolimi corpus.
Kukho inani 'ingcambu' emthini, elimele igama nje nge semantics ebanzi kakhulu. Njengokuba isebe kwiindawo umthi akhiwa, equlethe izinto ngakumbi nangakumbi ngqo lungelelanisa. Ngokomzekelo, igama elithi "isidalwa," esinxulunyaniswa kunye nezinto ezifana 'kwabantu "kunye" isilwanyana ". Igama lokuqala liya kuqhubeka lwamasebe ngaphandle imisebenzi eyahlukeneyo, imiqathango nomini, ubuzwe, kwaye eyesibini - kwiiklasi kunye neentlobo zezilwanyana.
Ukusetyenziswa kweenkqubo retrieval ulwazi
Iindawo ukusetyenziswa Ezolimi corpus zifikelela kwiinkalo ezahlukeneyo umsebenzi. Housings asetyenziswa ukulungiswa kunye nokulungiswa izichazi-magama, ukudala iinkqubo inguqulelo ezenzekelayo, annotating, ekubuyiseni izibakala, nokumisela ithoni kunye nezinye processing itekisi.
Ukongeza, izibonelelo ezifana belizwe zisetyenziswa ekufundeni iilwimi zehlabathi kunye ezisebenza ezisebenza zolwimi jikelele. Ukufikelela imiqulu esikhulu solwazi pre-uzilungiselele lula ukufunda ngokukhawuleza kwaye olunzulu lweentsingiselo iilwimi zophuhliso, kunye notshintsho ozinzileyo neologisms ukuyilwa isantya yokuthetha uyayixabisa iiyunithi lungelelanisa kunye nabanye.
Ekubeni umsebenzi kunye izixa ezikhulu ezifana lwe data kufuna esizisebenzelayo, namhlanje kukho intsebenziswano enkulu phakathi ikhompyutha kunye corpus Ezolimi.
Russian Corpus National
Eli tyala (olufinyeziweyo NKRYA) kuquka inani subcorpus, evumela ukusetyenziswa njengovimba iintlobo ngeentlobo zemisebenzi.
Izixhobo kwi database zihlukaniswe NKRYA:
- ukuba iimpapasho 90 kumajelo eendaba 'kunye kowama-2000, ngaphakathi kunye nangaphandle;
- ukurekhoda intetho;
- aktsentologicheski ephawulwe iitekisi (ngamanye amazwi, amanqaku ntshikilelo);
- intetho lwesigodi;
- isihobe;
- Izinto zokwakha kunye syntactic kunye nezinye iimpawu.
Inkqubo Ulwazi lukwaquka Subcorpus kunye iinguqulelo engumzekelo imisebenzi ukusuka Russian ukuya IsiNgesi, IsiJamani, IsiFrentshi kunye nezinye iilwimi (and vice versa).
Kwakhona kwiziko ledatha kukho icandelo le nezembali, omele intetho kubhaliwe Russian ngamaxesha ahlukeneyo ekhula yayo. Kukho kwakhona umzimba uqeqesho, oluno kubaluncedo yabemi angaphandle ekufundeni ulwimi Russian.
Russian National Corpus iquka iiyunithi lungelelanisa million 400, nangeendlela ezininzi ngaphambi yinxalenye ebalulekileyo lweelwimi zemizimba Europe.
amathuba
Inyaniso bakhetha kuthathelwe ingqalelo le ntsingiselo ukufumaneka ezithembisa laboratory corpus Ezolimi kwiiyunivesithi Russian, kwakunye ngolwasemzini. Ukusetyenziswa kunye nophando kwi-sikhokelo olu lwazi kunye search izibonelelo kuquka ukuphuculwa kweendawo ezithile entsimini lobuchwepheshe eziphakamileyo, iinkqubo umbuzo-ukuphendula, kodwa kuxoxwe ngasentla.
Uphuhliso ngakumbi corpus Ezolimi Kuqikelelwa kuwo onke amanqanaba, ukusuka zobugcisa malunga nokuphunyezwa kwe ubuchule ezintsha nokwandisa iinkqubo ukucinga nokuhoywa ulwazi, amandla iikhompyutha, RAM ngakumbi, kwaye zabathengi, kuba abasebenzisi iindlela ngakumbi nangakumbi ukuba usebenzise olu hlobo resource kwabo kwamihla ubomi kunye nomsebenzi.
Ekuphetheni
Phakathi kwinkulungwane yokugqibela ngo-2017 kwabonakala elizayo esikude, apho nangeziphekepheke ahambe iphela kunye robhothi benze wonke umsebenzi ngenxa yabantu. Enyanisweni, inzululwazi igcwele "amabala amhlophe" nokwenza iinzame lithemba ukuphendula imibuzo yoluntu kangangeenkulungwane ephazamisayo. Imibuzo ukusebenza kolwimi apha ziyathandwa indawo nembeko, ikhabhinethi kunye azo Ezolimi ezinokusinceda ukuyiphendula.
Uqwalaselo iisethi zedatha enkulu ukubona iipateni, ngaphambili engavumelekanga, ukuqikelela uphuhliso iimpawu zolwimi ethile ukulandelela ukuyilwa amagama ngexesha phantse real.
Xa kufikelelwa kwinkalo yokwenziwa, neentente zehlabathi kubonakala, umzekelo, njengesixhobo ezinokubakho ukuvavanya mood yoluntu - i-Internet zihlaziywa rhoqo yonke izicatshulwa imihla ezahlukeneyo adalwe umsebenzisi: oku izimvo kunye nohlolo, kunye namanqaku, kunye nezinye iindlela ezininzi zentetho.
Ukongeza, ukusebenza kunye namaqumrhu igalelo kuphuhliso iintsimbi enye, abandakanyeka ulwazi retrieval, wayeqhelene inkonzo "Google" okanye "Yandex", translation umatshini, izichazi-elektroniki thina.
Nathi ngokuzithemba singabhengeza ukuba Ezolimi corpus kwenza kuphela amanyathelo okuqala, kwaye kwixesha elizayo siya kuchuma.
Similar articles
Trending Now