DOM là cách thông thường và gần như là chuẩn mực để parse file XML, tuy nhiên với những file XML lớn thì dùng DOM không thực sự hiệu quả. Bài viết giới thiệu cách dùng SAX parser với Saxy.
Einstein had a saying that people always quote.
I have no clue why he said so or whether he did say so, as if a six year old knows Einstein, he/she is no longer six.
But I will try my best.
Everything starts with a tale
Once upon a time, there were three brothers separately Generator, Sender and Receiver (abbreviated GRS). They were so clever that the chief just hated them.
One day the chief came to the GRS and gave them a task. They had to synchronize library A and B in the town in order to make books in B exactly the same as A within 24 hours, otherwise they would be exiled.
That was kind of evil of the chief because it was an impossible mission!!! Imagine every library would have thousands of books and millions of pages. For sure the easiest way would be throwing away all the books in B and bringing over copies of books from A, but definitely it would be slow. Let’s say if the brothers could print 10 pages per minute, it would take a few months to synchronize all of them, but they only had 24 hours.
The chief thought this would be the end for the GRS, but surprise happened in the next day, library B had been completely and perfectly synchronized.
How did the GRS brothers do that?
The brothers realized that books in both library were almost the same, only a few pages were too old to be readable. So this was how they got the task done.
First Generator look though all books in library A and for each of them, he will message to Receiver who is right now in library B.
The one that B has but A does not => Generator will ask Receiver to throw away.
The one that A has but B does not => Generator will ask Receiver to make a copy of the one from A.
The one that they both have => Generator will ask Receiver to do the math based on The Algo, and draft it down to something called checksums as below.
1-10: 99 2-11: 98 3-12: 55 4-13: 102 ... 91-100: 123
Then he sends them to Sender.
After receiving the checksums, Sender can now use them and The Algo to find diffs between his book and the one in B.
First from page 1-10, Sender quickly counts there were 90 words in the segment, then compare to Receiver’s checksum for page 1-10 (which was 99), obviously they are different, so he writes number 1 to his draft and makes a copy of page 1.
Then he repeats the work for page 2-11, the words counts seem to be different again, so he continues to write 2 to his draft and also makes a copy of page 2.
The same happens for page 3-12, 4-13 and 5-14, undoubtedly Sender writes 3, 4, 5 to his draft and, you might already know, makes copies of those pages.
But for the segment of page 6-15, Sender realizes that the number of words is equal to Receiver’s, therefore he jumped immediately to page 16-25 instead of page 7-16 like previous steps, without writing anything or making any copy.
So on, same step is repeated until the end of the book, here is what he has in his draft.
1, 2, 3, 4, 5, 16, 101, 102, 103
and also the copies definitely. He sends them to Receiver and continues to work on other books.
After receiving the number list from Sender, Receiver will start synchronizing the book.
The list shows that
1, 2, 3, 4, 5, 16, 101, 102, 103 are pages that changed, so Receiver will remove those pages from his book and replace with the corresponding ones from Sender.
So, a book has been successfully synchronized without having to reprint everything.
Back to real life
Yes this is exactly how rsync works.
Library A and B are thereby the source and destination directories, books are files, and page segments are byte buffers.
The Algo mentioned is rolling checksum which is the heart of rsync. Of course the checksum algorithm is not as easy as counting words and it should not be. It is much more complex which you can see here.
There are two types of checksum in rsync: the weak and the strong. If the weak matches, senders will calculate the strong to make sure the buffers are really equal.
Practically the receivers won’t make changes directly on their local files but to a temp file instead. This ensures the local file undamaged in case of errors. After the syncing complete, this temp file will be renamed and replace the old one.
So as you see, this architecture brings effectiveness to rsync.
- Partial file synchronization: equalize two files without having to copy everything.
- Each part of GRS works separately and is non-blocking to each other, data is transferred through communication.