#### Topic: Sampling unique values

Hello!
I try from some thick doc-file in which there are tabular data for simplification  tasks to create a DB. Only these tables formed by different people, accordingly and filled them at own discretion. Something is not obviously possible for unifying now. I will explain on cats and other dumb animals: we admit, in cells there are such records: "a cat, a dog, a parrot"or"a dog, a hamster, a canary". These values can repeat in many lines. By sampling and splitting of lines into substrings on a comma, I receive unique values for everyone  ("cat", "dog", "parrot" and so forth). But some companions filled the table so: "cats," or are even worse than a dog:" Dogs domestic, thoroughbred, a cat ". That is in the table of unique animals there will be records: a dog, dogs. And if with it still somehow it is possible to be reconciled, here breaking the last record on a comma as to a separator, among animals appears certain"thoroughbred"that absolutely in any gate. I can not invent how to teach algorithm to understand, where dogs thoroughbred" come to an end "and"cat"begins. I will be grateful for the help.

#### Re: Sampling unique values

Manually, manually...

#### Re: Sampling unique values

Shm. wrote:

Hello!
.....
I can not invent how to teach algorithm to understand, where dogs thoroughbred "come to an end" and "cat" begins. I will be grateful for the help.

Well if in line there is a sequence of "dogs" - then found a dog if found sequence "" - that a cat.... Unless it is not obvious?
That it was easier to search - at once transform for example to the lower register all...
Also you can attack a mine with a letter "with" for it is on a place Latin "with", and lazy people, seeing that print not in that layout - erase only incorrect characters and further add the text, the word from cyrillic with the first Latin character turns out.

#### Re: Sampling unique values

AndreyTarasov wrote:

it is passed...
Well if in line there is a sequence of "dogs" - then found a dog if found sequence "" - that a cat.... Unless it is not obvious?
That it was easier to search - at once transform for example to the lower register all...
Also you can attack a mine with a letter "with" for it is on a place Latin "with", and lazy people, seeing that print not in that layout - erase only incorrect characters and further add the text, the word from cyrillic with the first Latin character turns out.

It just  is obvious, therefore and wrote that dogs not so it is terrible. And here is how from a line "dogs domestic, thoroughbred, a cat" to receive "dogs domestic", "dogs thoroughbred", "cat"?
Surrenders that only

Akina wrote:

Manually, manually...

#### Re: Sampling unique values

Well or at least "dogs domestic, thoroughbred" and "cat". As on  to express that "thoroughbred" the cat - already other animal concerns dogs, and. The matter is that I do not know in advance, what else there will be records further under the table.
As a variant - to select all unique with a separator on a comma and then to look in the list .

#### Re: Sampling unique values

Shm. wrote:

It just  is obvious, therefore and wrote that dogs not so it is terrible. And here is how from a line "dogs domestic, thoroughbred, a cat" to receive "dogs domestic", "dogs thoroughbred", "cat"?

??? In what a problem?? Well search to the main and additional sign if the main sign at you shares on sections.
1 column = the Dog (the found keyword)
2 column = Dogs domestic (the full text, for detailed learning)
Or distinctly  problems :-)

#### Re: Sampling unique values

AndreyTarasov wrote:

it is passed...
Or distinctly  problems :-)

it is distinctly written.

#### Re: Sampling unique values

Shm. wrote:

As a variant - to select all unique with a separator on a comma and then to look in the list .

As a variant: in the first pass to make the dictionary  words (hands), in the second pass at a meeting of a word from the dictionary to add the beginning from the previous.

#### Re: Sampling unique values

Dima T wrote:

it is passed...
it is distinctly written.

Absolutely not clearly. What it is necessary to find??? All dogs and to separate from cats?
Or to find all dogs and to divide them on categories, and then too most with cats?
If the first - that the minor signs are absolutely not important and it is not necessary to look at them, though domestic, though thoroughbred, etc.
And if the second - that nobody hinders to make 2 tables and to divide found on sections, defining everyone undressed on  to determinant - but he/she is the author did not sound, on it and the problem is not known.

#### Re: Sampling unique values

At first make primary analysis and generate the dictionary. And at once provide 2 fields - a token in the list and a token in the dictionary. Manually process it, to begin with simply correct obvious variants. It already gives formalization of the considerable part of an array. And at the second stage already process and "" those  which meet in the only thing or about that a copy.
As a result receive   conversions which can be used for automatic conversion of the "curve" list to the normal.

#### Re: Sampling unique values

With reference to the resulted example:
[quote =] "a cat, a dog, a parrot"
"A dog, a hamster, a canary"
"Cats, dogs"
"Dogs domestic, thoroughbred, a cat"

The translation table turns out such:
Ref. token, Game. A token
Dogs &comma; thoroughbred, a dog thoroughbred
Dogs &comma; thoroughbred, a dog domestic
Canary, canary
Parrot, parrot
Dog, dog
Dogs, a dog
Hamster, hamster
Cat, cat
Cats, a cat
The table is formed (or, anyway, is processed) as reduction of length of a token.