1

Topic: Optimization of search of anagrams

Received recently curious job for interview. Here the job text: Anagrams are defined as words with the same length and same number of characters in any order. Eg: bat/tab/abt are anagrams aabb/baba/bbaa/abab - anagrams abcd/abce - are not anagrams (different characters) aaab, aabb - are not anagrams (different character count) Given a list of words - {abcd, bbb, abc, bat, cat, yyyyxxxx, tab, atb, rat, cab, xxyyxxyyxx, atb} Print out the anagram pairs - {{abc, cab}, {bat, tab, atb, atb}} Assuming that the input is of type [a-z], can you optimize the solution. With anagrams all is simple - they easily are if to compare the sorted copies of lines. And here with optimization I and did not understand. I so understand, sense there in generation unique  for every line but how to make  identical to the different lines containing the same dial-up of letters, I yet did not understand. The colleague looked and told supposedly "about, here all is simple - it is necessary to use prime numbers". If I correctly understand, it is necessary to select prime numbers from this variant from 2 and more (and can and from 3, it is not assured) and to take result of multiplication. But 26 prime number in this variant will be 101, i.e. for everyone ' z ' we need to multiply  on 101. Then 6. 7 letters ' z ' give  which gets out for limits 64 bit. Prompt, what here would be the correct algorithm? And, what is more interesting - prompt, it is necessary to look at what else useful methods of hash coding better to understand this subject?

2

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> Prompt, what here would be the correct algorithm? And, what is more interesting - prompt, it is necessary to look at what else useful methods of hash coding better to understand this subject? hash = [bit0=has (a), bit1=has (b)... bit (25) =has (z), bit (26. 31) =length]

3

Re: Optimization of search of anagrams

Hello, kov_serg, you wrote: AK>> Prompt, what here would be the correct algorithm? And, what is more interesting - prompt, it is necessary to look at what else useful methods of hash coding better to understand this subject? _> hash = [bit0=has (a), bit1=has (b)... bit (25) =has (z), bit (26. 31) =length] not, well a bit there explicitly does not suffice, the registration of an amount of characters is necessary. The bit gives identical  for "ab" and "baaab" - too many collisions.

4

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> Hello, kov_serg, you wrote: AK>>> Prompt, what here would be the correct algorithm? And, what is more interesting - prompt, it is necessary to look at what else useful methods of hash coding better to understand this subject? _>> hash = [bit0=has (a), bit1=has (b)... bit (25) =has (z), bit (26. 31) =length] AK> not, well a bit there explicitly does not suffice, the registration of an amount of characters is necessary. The bit gives identical  for "ab" and "baaab" - too many collisions. At "ab" length=2, and at "baaab" length=5. Where collisions?

5

Re: Optimization of search of anagrams

Hello, kov_serg, you wrote: _>>> hash = [bit0=has (a), bit1=has (b)... bit (25) =has (z), bit (26. 31) =length] AK>> Not, well a bit there explicitly does not suffice, the registration of an amount of characters is necessary. The bit gives identical  for "ab" and "baaab" - too many collisions. _> at "ab" length=2, and at "baaab" length=5. Where collisions? I on length in the end of attention did not turn. Collisions will be at lines of identical length with an identical dial-up of used characters, "aaaab" and bbabb ", for example.

6

Re: Optimization of search of anagrams

7

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> Hello, kov_serg, you wrote: _>>>> hash = [bit0=has (a), bit1=has (b)... bit (25) =has (z), bit (26. 31) =length] AK>>> Not, well a bit there explicitly does not suffice, the registration of an amount of characters is necessary. The bit gives identical  for "ab" and "baaab" - too many collisions. _>> at "ab" length=2, and at "baaab" length=5. Where collisions? AK> I on length in the end of attention did not turn. AK> collisions will be at lines of identical length with an identical dial-up of used characters, "aaaab" and bbabb ", for example. Then lead to a canonical form and consider normal . For example c an anagram"bbaabb"-> aabbbb-> string_hash (aabbbb)

8

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> And here with optimization I and did not understand. I so understand, sense there in generation unique  for every line but how to make  identical to the different lines containing the same dial-up of letters, I yet did not understand. XOR all characters in quality , and then comparing only within received .

9

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> Hello, kov_serg, you wrote: _>>>> hash = [bit0=has (a), bit1=has (b)... bit (25) =has (z), bit (26. 31) =length] AK>>> Not, well a bit there explicitly does not suffice, the registration of an amount of characters is necessary. The bit gives identical  for "ab" and "baaab" - too many collisions. _>> at "ab" length=2, and at "baaab" length=5. Where collisions? AK> I on length in the end of attention did not turn. AK> collisions will be at lines of identical length with an identical dial-up of used characters, "aaaab" and bbabb ", for example. The variant which offered you  than it was not pleasant? Something of a type: int code (char c) {if (c> = ' a ' && c <= ' z ') return c-'a '; if (c> = ' A ' && c <= ' Z ') return c-'A '; return 26;} int hash (const char *s) {enum {N=26}; static const int p [N+1] = {3259,3271,3299,3301,3307,3313,3319,3323,3329,3331,3343,3347,3359, 3361,3371,3373,3389,3391,3407,3413,3433,3449,3457,3461,3463,3467, 1}; int h=1; for (; *s; s ++) h * = p [code (*s)]; return h;}

10

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> Received recently curious job for interview. Here the job text: AK> Anagrams are defined as words with the same length and same number of characters in any order. AK> Eg: AK> bat/tab/abt are anagrams AK> aabb/baba/bbaa/abab - anagrams AK> abcd/abce - are not anagrams (different characters) AK> aaab, aabb - are not anagrams (different character count) AK> Given a list of words - {abcd, bbb, abc, bat, cat, yyyyxxxx, tab, atb, rat, cab, xxyyxxyyxx, atb} AK> Print out the anagram pairs - {{abc, cab}, {bat, tab, atb, atb}} AK> Assuming that the input is of type [a-z], can you optimize the solution. AK> with anagrams all is simple - they easily are if to compare the sorted copies of lines. And here with optimization I and did not understand. I so understand, sense there in generation unique  for every line but how to make  identical to the different lines containing the same dial-up of letters, I yet did not understand. And  it is really necessary? Perhaps, simply to sort the row list, using as a predicate comparing of the sorted lines-elements?

11

Re: Optimization of search of anagrams

Hello, Artem Korneev, you wrote: AK> Prompt, what here would be the correct algorithm? And, what is more interesting - prompt, it is necessary to look at what else useful methods of hash coding better to understand this subject? And there are statements of the problem? The maximum number of repeating characters? Length of lines? With the hash table the decision is obvious. If it is necessary guaranteed n*log (n) at worst (and n on storage) it is possible to make a key a vector of counters of characters of a line and to sort by it.