[GIVEWAY]Content crawler Nodejs

rafongol

Member
Dec 7, 2020
96
77
18
Hello Babiato community this is my first contribution in this community hope I'm in the right forum 🙏😂 anyway today I'll share with you a simple script that i create to scrap content from websites specially the one without .htaccess restriction and I'll try also to explain how it work exactly so it will be easy for everyone.

So first of all I want to mention that this method is the same as (wget method) that we usually use to crawl some content from a website but this script it make it more easier.

What I need to know before I use the script ?
Absolutely nothing !

What I need to start scraping web pages ?
You only need to download Nodejs 👈👈

After you downloading Nodejs you need to download the script from the attached files
Then you need to save the script in a directory we will call this directory for example nodeAPP so we will have this tree

|--nodeAPP(This is our directory that we created)
|----index.js(This is our script that we already downloaded)

After that you need to open your CMD( CMD is the default command-line interpreter for windows or terminal if you are using linux),
Then you need to go to your nodeAPP directory through the CMD and their you need to type: npm install website-scraper .

Now after the installation of the module is done you need to open the script with any text editor and change the line number 4
urls: ['https://www.hereyouputyoururl.com'], 👈 here you put the website that you want to crawled.

And you change the line number 5
directory: './directory_name', 👈 here you put the name of the directory that you want to be created this directory will have after all the content that you downloaded with the script.

After that you open again the CMD and you navigate to our nodeAPP directory and you type : node index.js
you wait a little bit then.. VOILA! you will have your crawled content in the same directory with the same name you put it in the line 5
so for example if we put in the line 5 ( directory: './xwebsite' ) then my tree it will be exactly like this

|--nodeAPP(This is our directory that we created)
|----node_modules(This directory will be created automatically after you install the npm module)
|----xwebsite(This directory was created by the script and he has all the crawled content)
|----index.js(This is our script that we already downloaded from babiato)

That's it I hope it was clear for you it's not very complicated it's super easy give it a try Cheers, Mates!

Here you can find the wesite-scraper plugin from npm if you want to read about it.
 

Attachments

  • index.zip
    322 bytes · Views: 33
Last edited:

rafongol

Member
Dec 7, 2020
96
77
18
Thanks for your sharing. But it looks like a kind of website downloader/clone, not content cralwer, is it right?
if you mean by content crawler that you can’t crawl the data of the website it’s true but if you mean that tou cannot get all the website files and scripts it’s wrong
 

Latest posts

Forum statistics

Threads
69,246
Messages
908,560
Members
237,316
Latest member
CodeWhizX

About us

  • Our community has been around for many years and pride ourselves on offering unbiased, critical discussion among people of all different backgrounds. We are working every day to make sure our community is one of the best.

Quick Navigation

User Menu