1

Topic: What DB to select?

Is 50  the text information (csv - a format). For acceleration of reversal I overtook it in binary files (fulfilling conversions from the text in the necessary data types), receiving 33 . As a result speed of reading increased in 50/33 times, i.e. a bottleneck is not necessity to strain the processor for conversion from the text, but only operation with a hard disk.
In a binary format the data stores all fields (in structures), and all fields are read also. Therefore there was a thought to use a DB, not to read superfluous fields (and conversion from DB sampling in the necessary objects most likely as well as in the previous case practically will not decelerate operation).
What DB advise at similar volume? Whether approach for this purpose dbf - files (or they too are read entirely)? While criterion one - maximum speed.

2

Re: What DB to select?

It would be desirable to add that the DB will work on the same computer where handling of this data is fulfilled. In other words, small loading on the processor is desirable.
+ It would be desirable the simple decision with presence of the simple documentation
There is something similar?

3

Re: What DB to select?

databases with  storage
?

4

Re: What DB to select?

Basil A. Sidorov wrote:

databases with  storage
?

Thanks! I always thought that time SQL - requests allows to specify specific fields for sampling also reading happens only this field. What there was my surprise when I read it:

wrote:

However that happens if to select, for example, only 3 fields from the table, in which their only 50? Owing to row-wise data storage in traditional DBMS (necessary as we remember, for frequent operations of adding of new records in registration systems) will be read all the line long entirely with all fields. It means that is not important, whether 3 fields or 50 are necessary to us only, from a disk anyway all of them will be read entirely, passed through the controler of disk input-output and transferred the processor which already selects only necessary for request.

https://habr.com/post/95181/
I can not check that it is truth, and all SQL - programming no more than properties.

5

Re: What DB to select?

AlekseySQL wrote:

I can not check that it is truth, and all SQL - programming no more than properties.

It is truth. As well as that to the person who has created in table 47 of unnecessary fields
Followed tear off hands and to forbid to write articles. Even on .

6

Re: What DB to select?

Dimitry Sibiryakov wrote:

It is truth. As well as that to the person who has created in table 47 of unnecessary fields
Followed tear off hands and to forbid to write articles. Even on .

In one situation one data, in another - others, and in third - third is necessary...

7

Re: What DB to select?

AlekseySQL wrote:

In one situation one data, in another - others, and in third - third is necessary...

... And, most likely, this data should be spaced apart under different tables. Well, there, the third normal form, all affairs.
P.S. SQL not properties, and a declarative programming language.

8

Re: What DB to select?

I think it is necessary to try to write down given not an array of structures, and structure of arrays. Then at reading it will be possible to "skip" the unnecessary data. On the average 1 file occupies from me 700  and contains ~7 fields, i.e. 100  in the field. If to consider that ssd reads pages on 4  similar passes of the unnecessary data can lead to productivity increase.

9

Re: What DB to select?

AlekseySQL wrote:

I Think it is necessary to try to write down given not an array of structures, and structure of arrays.

I already said that application of single-pass algorithms can make all these dancings with
Tambourine the unnecessary?.

10

Re: What DB to select?

Dimitry Sibiryakov wrote:

it is passed...
I already said that application of single-pass algorithms can make all these dancings with
Tambourine the unnecessary?.

I do not think that change of algorithm of data handling refines something, as the main "" for speed of data reading from a disk.

11

Re: What DB to select?

AlekseySQL wrote:

I do not think that change of algorithm of data handling refines something, as the main
"" for speed of data reading from a disk.

For this reason reduction of an amount of disk readings - the most effective optimization.

12

Re: What DB to select?

AlekseySQL wrote:

Is 50  the text information (csv - a format). For acceleration of reversal I overtook it in binary files (fulfilling conversions from the text in the necessary data types), receiving 33 . As a result speed of reading increased in 50/33 times, i.e. a bottleneck is not necessity to strain the processor for conversion from the text, but only operation with a hard disk.

In the first your step you refused text data view and passed to binary and
It gave productivity in 50/33 (five third ?) time. As a whole is the correct direction
But only for that data which almost always NOT NULL. For set of lines containing vacuum
Similar conversion could give a boomerang effect (one of fields was VARCHAR (255) but it is real
From them the full size only 1 time was used and remaining lines did not exceed 10-20 characters).
And you lost performance on disk readings of "emptiness".
The second point - where you tell about that that a bottleneck - a hard disk of 80 % correct. It I can is simple
To speak about any DBMS and to benefit a bet in  an amount of disputes. Almost always I/O a bottleneck
Except for special In-memory dbms.

13

Re: What DB to select?

AlekseySQL wrote:

I can not check that it is truth, and all SQL - programming no more than properties.

Worse than the lie can be only semi-truth. Actually SQL programming as was and remains.
The "general-purpose" approaches of 20th century have simply been reconsidered. And to them added the specialized
Approaches which eliminated some relational operations (NoSQL) but thus added warranties
That that the document for example always  (MongoDB) also lies is physically consolidated on a disk.
Concerning article on Habre. I did not read it completely. But apropos column-oriented-dbms. It is the compromise
Between OLTP and analytics with "warp" towards analytics.  you lose performance of the future
inserts/updates Because it is necessary to support a vertical data structure but receive
More economical read mode. And at all readings of a specific line and aggregation (sum/avg/count/min/max)
Groups of lines.
By the way similar effect it is possible to reach more close if more rationally to spread out a column in normal DBMS.
Here the principle is uniform. We can disperse only in digits. On how many it is more than % or faster? .
It is necessary to do a breadboard model and to model. As the variant - to suppose yours 3 hot columns in the separate
Label. And to provide synchronism through triggers or mviews.
By the way concerning breadboard models. I watch many years operation DBMS Oracle and I am convinced of that that there is no general-purpose
Bullets which always approaches to any data + to loading. Therefore all who in a topic to you will convincingly advise
To use "favourite DBMS" or DBMS about which they 5 minutes ago read in Habre - liars.
They approach irresponsibly because not it then to rake issues - and to you.
Do not trust them.
That needs to be trusted? 1) it is necessary to take yours DBMS and to load there your label.
2) It is necessary to simulate workload.  will create the program-simulator which
As much as possible close. Close so how much it probably to load readings
Your label. 3) to measure average time for each class .
After that it is possible to start 2nd phase. Actually to optimization. It is process
and difficult. It - the equation with many unknown persons. And  further
To speak about it early. It is necessary to wait from you digits.
If you generally do not have convergence about workload that - my congratulations. We cannot
In a topic to develop the approach which you rescues.
Well and certainly I will repeat Dzhonatana Lewis's words. He speaks - you should "know" your data.
at business level to understand where there are internal dependences and warps (skew) in area
Statisticans. For example one value foreign key can fill 99 % of lines. It it is invisible
On the relational chart never. But as the expert on system you should know this fact
And to prevent useless plans which include for example  sampling
This magic key.

14

Re: What DB to select?

mayton wrote:

In the first your step you refused text data view and passed to binary and
It gave productivity in 50/33 (five third ?) time.

Yes.
mayton, thanks, all is already made on the real data so modeling is not required. Simply I thought that I will strongly spare on time of time from - that it is not necessary to convert the data. And it appeared that this part is insignificant, and the basis of all time is occupied with data reading from a file.

15

Re: What DB to select?

On what"simulated.

16

Re: What DB to select?

It is possible to do a composite index, on the necessary columns then the data will undertake from index leaves.
And in MS SQL (about others I do not know) there is a possibility to add to an index of value of not index columns (create index... include...)

17

Re: What DB to select?

Arm79 wrote:

It is possible to do a composite index, on the necessary columns then the data will undertake from index leaves.
And in MS SQL (about others I do not know) there is a possibility to add to an index of value of not index columns (create index... include...)

The author writes
[quote =] I Think it is necessary to try to write down given not an array of structures, and structure of arrays. Then at reading it will be possible to "skip" the unnecessary data. On the average 1 file occupies from me 700  and contains ~7 fields, i.e. 100  in the field. If to consider that ssd reads pages on 4  similar passes of the unnecessary data can lead to productivity increase.

Let's consider trivial cases.
1) the Author does analytics on 1-2-3 columns (hotfield1, hotfield2, hotfield3). In that case (oh as I like it to do... To finish thinking
For the author of setting and a title... It not a joke!) it is possible:

create index hotFieldsForAnalytic on huge50gbTable (hotfield1, hotfield2, hotfield3);

And in requests to specify to the optimizer that we recommend INDEX_FAST_FULL_SCAN. And the table leaves the plan.
Profit.
2) we Do a copy. A small label

create table smallTable as select hotfield1, hotfield2, hotfield3 from huge50gbTable;
//Optionally, triggers after DML, mviews.

Profit.