Topic: Audio record in an array

All greetings! At me such problem (I do not know from what to begin): There is an audio record with speech. It is necessary to separate words (the loudest sounds) and as that to anchor them at the right time. Whether it is possible to do without any libraries? Like that difficult, but I do not know from what to begin, in what side to dig. And still a question: what file format is more reasonable for using? And generally the task such: It is necessary to recognize words from video of files and to translate them in the text (through asr.yandex.net, etc). Separation of words from each other, is necessary for step-by-step illumination , i.e. as in a karaoke.