Handling of Big Data on the Internet

When we handle text files that are about 4000-5000 lines long, it is difficult, time-consuming, Input/Output overhead, non-scalable, hardware fault, unnecessary repetition of code, loss of memory space, difficult to process errors and prevent error propagation, etc. The term "Big Data" applies to the above data.

Suppose I have a file that is in the English language and is about 4000-5000 lines long with an average of 100 characters per line with a few anomalies as blank lines

Heuristics

The heuristics we are going to sue are as follows:

Articles (a, an, the)
Prepositions (of, at, about, around, besides, aside, above, over)
Conjunctions (and, between, or, because, hence, since, although, though, not only, but also, but, so, therefore)
Adverbs ( adverbs are easy to recognize as they mostly have "ly" as their suffix
Pronouns (I, he, we, our, their, he, she, it)
This, that, there

words ending in "tion"
words ending in "less"
words ending in "ful"

"in"
"dis"
"con"
"com"

English language characteristics

Frequency of characters in the English Language
It has been proven in research that e has the highest rank in terms of all words in the English language
Also, vowels like a,e, I,o, and u are more frequent than the consonants
On average, every word has 20% vowels
Group of letters: Certain groups of letters is common like "un", "dis", "com" con", "st" th" qu"

Association Rule mapping

Sometimes a group of words is repeatedly used in a paragraph or a stanza of a poem

These groups of words examples are

I came
Thank you
Please help
Take care
Good morning, etc.

Server architecture

Server architecture for parallel processing and load balancing and backup server plus database server The server cluster consists of one name node (primary server)
and three(n>=3) child nodes (secondary)
The authority is with the name node i.e., the primary server.
The primary server will distribute the load to the secondary servers.
The first secondary process will scan the line from stanza 1.
Asynchronously, the second secondary process will try to process stanza 2.
Similarly, the third secondary server will poll the first and second servers and when completed, will merge the two outputs.
This process continues till the end of the file is reached.
Suppose I have a poem written by William Shakespeare that has 400 stanzas each stanza has 8 to 10 lines and each line has 100 characters, that is a lot of old data.
The starting point or compression of the above files is to declare a dictionary with the most commonly used words , like

because
but
about
have

There are many such commonly used words

What the name node will do is divide the big data into two stanzas at a time and then when all these parallelly processes will compress the stanza and send it to the third secondary node. This third secondary node will merge the two results and compress them using the same algorithm. This server will send the result to the primary server and then, the primary server will repeat the same steps until all the stanzas are finished

The important point to remember is to assign a third server

Storing commonly used words in an array in alphabetical order (from a to z)


let common_words_list = [
  "a","about","above","after","again",
  "against","all","am","an","and",
  "any","are","aren't","as","at",
  "be","because","been","before","being",
  "below","between","both","but","by",
  "can't","cannot","could","couldn't",
  "did","didn't","do","does","doesn't",
  "doing","don't","down","during","each",
  "few","for","from","further","good",
  "had","hadn't","has","hasn't","have",
  "haven't","having","he","he'd","he'll",
  "he's","her","here","here's","hers",
  "herself","him","himself","his","how",
  "how's","i","i'd","i'll","i'm",
  "i've","if","in","into","is",
  "isn't","it","it's","its","itself",
  "let's","me","more","most","mustn't",
  "my","myself","no","nor","not",
  "of","off","on","once","only",
  "or","other","ought","our","ours",
  "ourselves","out","over","own","same",
  "shan't","she","she'd","she'll","she's",
  "should","shouldn't","so","some","such",
  "than","that","that's","the","their",
  "theirs","them","themselves","then","there",
  "there's","these","they","they'd","they'll",
  "they're","they've","this","those","through",
  "to","too","under","until","up",
  "very","was","wasn't","we","we'd",
  "we'll","we're","we've","were","weren't",
  "what","what's","when","when's","where",
  "where's","which","while","who","who's",
  "whom","why","why's","with","won't",
  "would","wouldn't","you","you'd","you'll",
  "you're","you've","your","yours","yourself",
  "yourselves" ];

The sample input to the first secondary server is:
Stanza #1:-

Stanza #2:-

Algorithm | Part 1

Steps:

Find the name of each word in the stanza and replace it with its code
Also, store the frequency of every word in your common words list
The most frequent word will be assigne dthe lowest bit space code
All codes should be non-prefix codes, meaning that no code should be present as the prefix of another code to avoid ambiguity while decoding the file


let x=common_words_list[common_words_list.indexOf(word);
frequency = new Array(174);
frequency[x]++;

Remaining Steps:

Sort the frequency in descending order
The most popular word will be given the least number of bits in the binary code
After processing then word, each compressed alphabet should be stored in frequency_alphabet[x]
Then again sort the frequency-alphabet in descending order
The most popular alphabet will be given least binary number
Both the servers will send their output to the third secondary server
It will concatenate the two results and apply the same compressor algorithm used by the other secondary servers
It will send the result to both the server
This process will continue till the end of the file is reached

Let us consider the first stanza which will be compressed by the first secondary server

If you consider the first line =>
From favorite creatures we desire increase "From" and "we" are there in the dictionary So we store frequency["we"] = 1 frequency["from"] =1; This process is repeatedly continued for all the lines. So, our intermediate code is

Search This Blog

Web Technology Tutorials