Handling of Big Data on the Internet
When we handle text files that are about 4000-5000 lines long, it is difficult, time-consuming, Input/Output overhead, non-scalable, hardware fault, unnecessary repetition of code, loss of memory space, difficult to process errors and prevent error propagation, etc. The term "Big Data" applies to the above data.
Heuristics
- Articles (a, an, the)
- Prepositions (of, at, about, around, besides, aside, above, over)
- Conjunctions (and, between, or, because, hence, since, although, though, not only, but also, but, so, therefore)
- Adverbs ( adverbs are easy to recognize as they mostly have "ly" as their suffix
- Pronouns (I, he, we, our, their, he, she, it)
- This, that, there
- Suffixes
- words ending in "tion"
- words ending in "less"
- words ending in "ful"
- Prefixes
- "in"
- "dis"
- "con"
- "com"
English language characteristics
- Frequency of characters in the English Language
- It has been proven in research that e has the highest rank in terms of all words in the English language
- Also, vowels like a,e, I,o, and u are more frequent than the consonants
- On average, every word has 20% vowels
- Group of letters: Certain groups of letters is common like "un", "dis", "com" con", "st" th" qu"
Association Rule mapping
- I came
- Thank you
- Please help
- Take care
- Good morning, etc.
Server architecture
Server architecture for parallel processing and load balancing and backup server plus database server The server cluster consists of one name node (primary server)
and three(n>=3) child nodes (secondary)
The authority is with the name node i.e., the primary server.
The primary server will distribute the load to the secondary servers.
The first secondary process will scan the line from stanza 1.
Asynchronously, the second secondary process will try to process stanza 2.
Similarly, the third secondary server will poll the first and second servers and when completed, will merge the two outputs.
This process continues till the end of the file is reached.
Suppose I have a poem written by William Shakespeare that has 400 stanzas each stanza has 8 to 10 lines and each line has 100 characters, that is a lot of old data.
The starting point or compression of the above files is to declare a dictionary with the most commonly used words , like
- because
- but
- about
- have
What the name node will do is divide the big data into two stanzas at a time and then when all these parallelly processes will compress the stanza and send it to the third secondary node. This third secondary node will merge the two results and compress them using the same algorithm. This server will send the result to the primary server and then, the primary server will repeat the same steps until all the stanzas are finishedThe important point to remember is to assign a third server
Storing commonly used words in an array in alphabetical order (from a to z)
let common_words_list = [
"a","about","above","after","again",
"against","all","am","an","and",
"any","are","aren't","as","at",
"be","because","been","before","being",
"below","between","both","but","by",
"can't","cannot","could","couldn't",
"did","didn't","do","does","doesn't",
"doing","don't","down","during","each",
"few","for","from","further","good",
"had","hadn't","has","hasn't","have",
"haven't","having","he","he'd","he'll",
"he's","her","here","here's","hers",
"herself","him","himself","his","how",
"how's","i","i'd","i'll","i'm",
"i've","if","in","into","is",
"isn't","it","it's","its","itself",
"let's","me","more","most","mustn't",
"my","myself","no","nor","not",
"of","off","on","once","only",
"or","other","ought","our","ours",
"ourselves","out","over","own","same",
"shan't","she","she'd","she'll","she's",
"should","shouldn't","so","some","such",
"than","that","that's","the","their",
"theirs","them","themselves","then","there",
"there's","these","they","they'd","they'll",
"they're","they've","this","those","through",
"to","too","under","until","up",
"very","was","wasn't","we","we'd",
"we'll","we're","we've","were","weren't",
"what","what's","when","when's","where",
"where's","which","while","who","who's",
"whom","why","why's","with","won't",
"would","wouldn't","you","you'd","you'll",
"you're","you've","your","yours","yourself",
"yourselves" ];
The sample input to the first secondary server is:
Stanza #1:-
Stanza #2:-
Algorithm | Part 1
Steps:
- Find the name of each word in the stanza and replace it with its code
- Also, store the frequency of every word in your common words list
- The most frequent word will be assigne dthe lowest bit space code
- All codes should be non-prefix codes, meaning that no code should be present as the prefix of another code to avoid ambiguity while decoding the file
let x=common_words_list[common_words_list.indexOf(word);
frequency = new Array(174);
frequency[x]++;
Remaining Steps:
- Sort the frequency in descending order
- The most popular word will be given the least number of bits in the binary code
- After processing then word, each compressed alphabet should be stored in frequency_alphabet[x]
- Then again sort the frequency-alphabet in descending order
- The most popular alphabet will be given least binary number
- Both the servers will send their output to the third secondary server
- It will concatenate the two results and apply the same compressor algorithm used by the other secondary servers
- It will send the result to both the server
- This process will continue till the end of the file is reached
From favorite creatures we desire increase "From" and "we" are there in the dictionary So we store frequency["we"] = 1 frequency["from"] =1; This process is repeatedly continued for all the lines. So, our intermediate code is
Comments
Post a Comment