1

Topic: Binary search in the big text file

There is a file in which lines are arranged. Lines of different length (about 50 characters). It is necessary to find all the line long with the specified prefix. If to load in storage, banal binary search. If in storage not to load, as a whole all is clear how to do, but a heap of trifles. We halve a file byte by byte, we move to one side, we search for the beginning of a line, we move to other side, we search for a line end. Still with UTF-8 it is desirable to work, so it is possible and to get to the character middle initially. Adequate buffering, kilobyte on 4 is necessary. In general it is a lot of nuances. Who can saw in the nature ready good implementation of such task?

2

Re: Binary search in the big text file

Hello, vsb, you wrote: vsb> who Can saw in the nature ready good implementation of such task? It is possible mapdb to try. There file-backed collections, are BTreeMap c Prefix submaps, still, how much I remember,  the size of used storage.

3

Re: Binary search in the big text file

Hello, StanislavK, you wrote: vsb>> who Can saw in the nature ready good implementation of such task? SK> it is possible mapdb to try. There file-backed collections, are BTreeMap c Prefix submaps, still, how much I remember,  the size of used storage. As far as I understand, there the format. I should search in a file which comes from the outside, I cannot influence its format.

4

Re: Binary search in the big text file

Hello, vsb, you wrote: SK>> It is possible mapdb to try. There file-backed collections, are BTreeMap c Prefix submaps, still, how much I remember,  the size of used storage. vsb> as far as I understand, there the format. I should search in a file which comes from the outside, I cannot influence its format. I so understood what to sort in storage not  as a file can be big so there is a variant was most to implement sorting in a file, what not that that is difficult, but also is not trivial. BTreeMap it is one more variant - it is possible to flood simply there all file and then to search on it.

5

Re: Binary search in the big text file

Hello, vsb, you wrote: vsb> There is a file in which lines are arranged. Lines of different length (about 50 characters). It is necessary to find all the line long with the specified prefix. If to load in storage, banal binary search. If in storage not to load, as a whole all is clear how to do, but a heap of trifles. We halve a file byte by byte, we move to one side, we search for the beginning of a line, we move to other side, we search for a line end. Still with UTF-8 it is desirable to work, so it is possible and to get to the character middle initially. Adequate buffering, kilobyte on 4 is necessary. In general it is a lot of nuances. Who can saw in the nature ready good implementation of such task? Here or manual binary search in lines, or the single-valued scanning of all file if to load it in most optimal of structures.

6

Re: Binary search in the big text file

Hello, vsb, you wrote: vsb> It is necessary to find all the line long with the specified prefix... We move to other side, we search for a line end. What for a line end? vsb> Still with UTF-8 it is desirable to work, so it is possible and to get to the character middle initially. It is not necessary, line end search does not influence in any way parsing of characters in the found position. vsb> in general it is a lot of nuances. Who can saw in the nature ready good implementation of such task? And in my opinion all is primitive. I would do a bicycle.

7

Re: Binary search in the big text file

Hello, Blazkowicz, you wrote: vsb>> in general it is a lot of nuances. Who can saw in the nature ready good implementation of such task? B> and in my opinion all is primitive. I would do a bicycle. Well, all so tell https://research.googleblog.com/2006/06 … early.html https://www.quora.com/There-was-a-bug-i … as-the-bug etc

8

Re: Binary search in the big text file

Hello, vsb, you wrote: vsb> There is a file in which lines are arranged. Lines of different length (about 50 characters). It is necessary to find all the line long with the specified prefix. If to load in storage, banal binary search. If in storage not to load, as a whole all is clear how to do, but a heap of trifles. We halve a file byte by byte, we move to one side, we search for the beginning of a line, we move to other side, we search for a line end. Still with UTF-8 it is desirable to work, so it is possible and to get to the character middle initially. Adequate buffering, kilobyte on 4 is necessary. In general it is a lot of nuances. Who can saw in the nature ready good implementation of such task? To make an index - a file in which will be stored  the beginnings of lines. The index is loaded completely in storage. If the index turns out too big to make "the discharged" index, i.e. to index not all the line long successively, and everyone N th.

9

Re: Binary search in the big text file

Refused this idea as a result. Though search and log2N, but it all the same the order of 20 read operations even for one million records. On each operation of 20 times to read a file - as a result long quits. Stupidly I overtake the source file in h2 basis with an index, as a result normally all works with a minimum of readings with normal  and  and the code primitive.